Parallel Efficiency Calculation Method and Apparatus 

TECHNICAL FIELD OF THE INVENTION 

5 The present invention relates to a performance evaluation 

technique of a parallel computer system and a usage technique of the 
result of the performance evaluation. Incidentally, the technique of 
the invention can be applied to all fields in which a parallel processing 
is performed, such as a field (structural analysis, fluid analysis, 
10 computational chemistry, etc.) handled in the conventional high 
performance computing (HPC) , a biosimulation expanded on a grid or a 
cluster, or a Web service (for example, MtoM (Machine to Machine) ) . 

BACKGROUND OF THE INVENTION 

15 

The performance of a parallel computer system remarkably varies 
for each application. Accordingly, its performance evaluation is 
important- A performance evaluation method of the parallel computer 
system includes (1) a method in which a specific processing is executed 

20 by various computers and a comparison is made, and (2) a self -complete 
type method to evaluate how much performance a certain computer 
demonstrates as compared with its own potential. The former is mainly 
used for performance comparison among computers as a benchmark test. 
The latter is required to be executed in practical use after introduction , 

25 Although the self -complete type performance evaluation can be carried 
out by using an index called a parallel efficiency, it has not been 
actually executed. Besides, although a parallel performance evaluation 
(so-called scalability evaluation) can also be made instead of the 
calculation of the parallel efficiency, in which time measurements are 

30 carried out while the number p of processors is changed, and a comparison 
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of the decrease degree with an ideal decrease degree of 1/p is made, 
since it is necessary to make measurements several times, the evaluation 
is not made in general. Besides, the scalability evaluation is 
qualitative, and a strict parallel performance evaluation cannot be 
5 performed. Accordingly, at present, processings with poor parallel 
efficiency cannot be detected, and they are put in an ungoverned state. 

The performance evaluation of a parallel processing by using the 
parallel efficiency is performed by calculating a parallel efficiency 
Ep(p) determined by expressions (1) and (2) set forth below. Where, p 
10 is the number of processors, x(l) is a processing time in a case where 
a processing is executed by one processor, T(p) is a processing time 
in a case where the same processing is executed by p processors, and 
Xi(p) is a processing time of an i-th processor under 1 < i < p. 

t(p) = Max{Ti(p)) (2) 
1=1 

15 The expression (1) is disclosed in, for example, a document 

"PERFORMANCE EVALUATION OF GeoFEM OF PARALLEL FINITE ELEMENT METHOD CODE, 
Transactions of JSCES, No. 20000022 (2000) by Tsukaya, Nakamura, Okuda, 
and Yagawa". 

However, even if the parallel efficiency is determined by the 
20 conventional method, since the quantitative relation to parallel 
performance impediment factors is not clear, it has not been understood 
which impediment factor has what influence on the parallel efficiency. 
Besides, in a certain parallel performance evaluation technique (For 
example, Japanese Patent Application No. 2001-241121, and US 
25 Publication No. US-2003-0036884-A1) , as shown in Fig- 1, there is 
required a condition ^^load balance is kept, and respective processing 
times Yi (parallel part) , Xi^i (redundancy processing part) , %i,2 
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(communication part) , or Xi, others (other parallel performance impediment 
factors) are identical to one another for all ^^i"", and there has been 
a problem that it can be applied to only a certain special parallel 
processing - 

5 Besides, it is difficult to apply the conventional methods to the 

parallel processing by a grid or a cluster. This is because when 
resources distributed on the grid or cluster and required for 
calculation, such as memories, data and CPUs, are concentrated in one 
processor, there often occurs a case where the processing becomes so 

10 large that it can not be accomplished by the one processor. That is, 
it is difficult to measure x(l) itself. Besides, to obtain t(1) and T(p) 
in the expression (I) by actual measurement supposes that the 
performances of processors are identical to one another. However, since 
the respective processor performances on the grid or the cluster are 

15 generally different from one another, there is also a problem that even 
if the actually measured x(l) and T(p) are substituted into the 
expression (1) , an accurate parallel efficiency cannot be determined. 

SUMMARY OF THE INVENTION 

20 

An object of the invention is therefore to provide a parallel 
processing performance evaluation technique in which a condition of 
^^load balance is kept" is removed, and which can be applied to many kinds 
of parallel processings including a heterogeneous computer system 

25 environment, and quantitatively correlates the parallel efficiency with 
parallel performance evaluation indexes and parallel performance 
impediment factors . 

Another object of the invention is to provide a technique for 
enabling suitable use of a parallel computer system by using the parallel 

30 efficiency and the like. 
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still another object of the invention is to provide a technique 
for enabling a suitable judgment on capability increase, renewal, or 
the like of a parallel computer system by using a parallel efficiency 
and the like . 

5 Still another object of the invention is to provide a technique 

for enabling the suitable execution of tuning and/ or selection of an 
algorithm of a program executed in a parallel computer system. 

According to a first aspect of the invention, a parallel 
efficiency calculation method for computing a parallel efficiency of 

10 a parallel computer system comprises the steps of: calculating a load 
balance contribution ratio Elb (p) representing a load balance degree 
between respective processors included in the parallel computer system 
and storing it into a storage device; calculating a virtual 
parallelization ratio Rp (p) representing a ratio, with respect to time, 

15 of a portion processed in parallel by the respective processors among 
processings executed in the parallel computer system and storing it into 
the storage device; calculating a parallel performance impediment 
factor contribution ratio Rj (p) representing a ratio of a processing 
time of a processing portion corresponding to each parallel performance 

20 impediment factor to a total processing time of all the processors 
included in the parallel computer system and storing it into the storage 
device; and calculating the parallel efficiency by using the load 
balance contribution ratio Rb (p) , the virtual parallelization ratio 
Rp (p) / and the parallel performance impediment factor contribution 

25 ratio Rj (p) (for example, in accordance with an expression (4-4) in an 
embodiment) and storing it into the storage device. 

By this, the parallel efficiency is quantitatively correlated 
with the parallel performance evaluation index such as the load balance 
contribution ratio, the virtual parallelization ratio, and the parallel 

30 performance impediment factor contribution ratio- The parallel 



efficiency may be outputted to an output device such as a display device 
with at least any of the load balance contribution ratio Rb(p)^ the 
virtual paralleli zation ratio Rp (p) , and the parallel performance 
impediment factor contribution ratio Rj (p) . 
5 According to a second aspect of the invention, a parallel 

efficiency calculation method for calculating a parallel efficiency of 
a parallel computer system comprises the steps of: calculating a load 
balance contribution ratio Rb (p) representing a load balance degree 
between respective processors included in the parallel computer system 

10 and storing it into a storage device; calculating an acceleration ratio 
Ap (p) representing a limit of improvement in a shortening degree of a 
processing time by parallelization of a processing executed in the 
parallel computer system and storing it into storage device; calculating 
a parallel performance impediment factor contribution ratio Rj (p) 

15 representing a ratio of a processing time of a processing portion 
corresponding to each parallel performance impediment factor to a total 
processing time of all the processors included in the parallel computer 
system and storing it into the storage device; and calculating the 
parallel efficiency by using the load balance contribution ratio Rb (p) , 

20 the acceleration ratio Ap(p) and the parallel performance impediment 
factor contribution ratio Rj (p) (for example, in accordance with an 
expression (4-5) in an embodiment) and storing it into the storage 
device - 

By this, the parallel efficiency is quantitatively correlated 
25 with the parallel performance evaluation index, such as the load balance 
contribution ratio and the parallel performance impediment factor 
contribution ratio, and an auxiliary index such as the acceleration 
ratio- The parallel efficiency may be outputted to an output device such 
as a display device with at least any of the load balance contribution 
30 ratio Rb (p) , the acceleration ratio Ap(p) and the parallel performance 



impediment factor contribution ratio Rj (p) . 

According to a third aspect of the invention, a parallel 
efficiency calculation method for calculating a parallel efficiency of 
a parallel computer system comprises the steps of: calculating a load 
5 balance contribution ratio Rb (p) representing a load balance degree 
between respective processors included in the parallel computer system 
and storing it into a storage device; calculating a parallel performance 
impediment factor contribution ratio Rj (p) representing a ratio of a 
processing time of a processing portion corresponding to each parallel 

10 performance impediment factor to a total processing time of all the 
processors included in the parallel computer system and storing it into 
the storage device; and calculating the parallel efficiency by using 
the load balance contribution ratio Rb(p) and the parallel performance 
impediment factor contribution ratio Rj (p) (for example, in accordance 

15 with an expression (8-2) in an embodiment) and storing it into the 
storage device. 

For example, if the sum of processing times of portions processed 
in parallel in the processings executed by the respective processors 
included in the parallel computer system is almost identical to a 

20 processing time in the case where the same processing is executed by 
one processor, that is, if processing content can be almost processed 
in parallel, a calculation of the parallel efficiency can be made in 
this way. The parallel efficiency may be outputted to an output device 
such as a display device with at least any of the load balance 

25 contribution ratio Rb(p) and the parallel performance impediment factor 
contribution ratio Rj (p) . 

According to a fourth aspect of the invention, a parallel 
efficiency calculation method for calculating a parallel efficiency of 
a parallel computer system comprises the steps of: calculating a load 

30 balance contribution ratio Rb (p) representing a load balance degree 



between respective processors included in the parallel computer system 
and storing it into a storage device; calculating a virtual 
parallelization ratio Rp (p) representing a ratio, with respect to time, 
of a portion processed in parallel by the respective processors among 
5 processings executed in the parallel computer system and storing it into 
the storage device; and calculating the parallel efficiency by using 
a sum of processing times of portions processed in parallel in 
processings executed in the respective processors included in the 
parallel computer system, a sum of processing times of the processings 

10 executed in the respective processors, the load balance contribution 
ratio Rb(p) , and the virtual parallelization ratio Rp(p) (for example, 
in accordance with an expression (9-1) in an embodiment) and storing 
it into the storage device . This is a modified example of the first 
aspect of the invention. The parallel efficiency may be outputted to 

15 an output device such as a display device with at least any of the sum 
of processing times of portions processed in parallel in processings 
executed in the respective processors, the sum of processing times of 
the processings executed in the respective processors, the load balance 
contribution ratio Rb (p) , and the virtual parallelization ratio Rp (p) . 

20 According to a fifth aspect of the invention, a parallel 

efficiency calculation method for calculating a parallel efficiency of 
a parallel computer system comprises the steps of: calculating a first 
processing time equivalent to a total processing time of parallel 
performance impediment portions of a processing in a case where the 

25 processing is executed by one processor and storing it into a storage 
device; calculating a second processing time as a sum of processing times 
of portions processed in parallel in processings executed in respective 
processors included in the parallel computer system; and calculating 
the parallel efficiency by using the number of the processors used in 

30 the parallel computer system, a longest processing time in processing 



times of the processings executed in the respective processors included 
in the parallel computer system, the first processing time, and the 
second processing time (for example, in accordance with an expression 
(9-2) in an embodiment) and storing it into the storage device. 
5 On the basis of predetermined modeling, the parallel efficiency 

can be computed by only the processing times obtained by one measurement. 

The parallel efficiency may be outputted to an output device such 
as a display device with at least any of the number of the processors 
used in the parallel computer system, the longest processing time in 

10 processing times of the processings executed in the respective 
processors included in the parallel computer system, the first 
processing time, and the second processing time. 

Besides, in the aforementioned load balance contribution ratio 
calculating step, the load balance contribution ratio Rb(p) may be 

15 calculated by dividing the total processing time of the processings 
executed in all the processors included in the parallel computer system 
by a longest processing time in the processing times of the processings 
executed in the respective processors included in the parallel computer 
system and the number of the processors used in the parallel computer 

20 system (for example, in accordance with an expression (5) in an 
embodiment) , 

Further, in the aforementioned virtual parallelization ratio 
calculating step, the virtual parallelization ratio Rp (p) may be 
calculated by dividing a sum of the processing times of the portions 

25 processed in parallel in the processings executed in the respective 
processors included in the parallel computer system by a processing time 
equivalent to a third processing time in a case where the same processing 
is executed by one processor (for example, in accordance with an 
expression (6-1) in an embodiment) . 

30 Besides, in the aforementioned parallel performance impediment 



factor contribution ratio calculating step, the parallel performance 
impediment factor contribution ratio Rj (p) concerning a specific 
parallel performance impediment factor may be calculated by dividing 
a sum of the processing times of processing portions corresponding to 
5 the specific parallel performance impediment factor in the respective 
processors included in the parallel computer system by a sum of the 
processing times of the respective processors included in the parallel 
computer system (for example, in accordance with an expression (7) in 
an embodiment) . 

10 Besides, in the aforementioned acceleration ratio calculating 

step, the acceleration ratio may be calculated as a reciprocal of a value 
calculated by subtracting the virtual parallelization ratio, which is 
calculated by dividing a sum of processing times of portions processed 
in parallel in processings executed in the respective processors 

15 included in the parallel computer system, by a processing time 
equivalent to a third processing time in a case where the same processing 
is executed by one processor, from 1 (for example, in accordance with 
an expression (6-2) in an embodiment) . 

Further, there is also a case where the aforementioned processing 

20 time is represented by the number of times of confirmation of 
corresponding events, in addition to an actual processing time. 

Besides, a step of calculating an auxiliary index by multiplying 
the calculated parallel efficiency by the number of processors used in 
the parallel computer system and storing it into the storage device may 

25 be further included- By this, it becomes possible to indicate how many 
processors correspond to the processing executed in the parallel 
computer system. 

Further, the aforementioned third processing time may be 
calculated by a sum of a first processing time equivalent to a total 

30 processing time of portions corresponding to parallel performance 



impediment factors in a processing in a case where the processing is 
executed by one processor and a second processing time as a sum of 
processing times of portions processed in parallel in the processings 
executed in the respective processors included in the parallel computer 
5 system (for example, in accordance with an expression (15) in an 
embodiment) . It becomes possible to make a calculation of the parallel 
efficiency etc. by one measurement of the processing time by 
predetermined modeling - 

Further, the aforementioned first processing time may be 

10 calculated by a sum of processing times of redundancy processings or 
communication processings in the processings executed in the respective 
processors included in the parallel computer system (for example, in 
accordance with an expression (12-1) in an embodiment) . 

Besides, the first to fifth aspects of the invention may further 

15 comprise the steps of: setting a target parallel efficiency; and 
calculating an optimum number of processors by dividing a product of 
the calculated parallel efficiency and the number of processors by the 
target parallel efficiency and storing it into the storage device- Even 
if many processors are added, the processing time is not necessarily 

20 shortened, and if the optimum number of processors can be computed as 
stated above, it is possible to prevent wasteful addition of resources. 
The optimum number of processors may be outputted to an output device 
such as a display device. 

Further, the first to fifth aspects of the invention may further 

25 comprise the steps of: setting an increase of a working time at a time 
of system expansion and a predicted parallel efficiency; and calculating 
an acceleration ratio at the time of the system expansion by dividing 
a sum of a product sum, with respect to all the processings, of a sum 
of processing times of the processings executed in the respective 

30 processors presently included in the parallel computer system and the 
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calculated parallel efficiency^ and a product of the increase of the 
working time and the predicted parallel efficiency, by a sum of working 
times of the respective processors presently included in the parallel 
computer system (for example, in accordance with an expression (18) in 
5 an embodiment) and storing it into the storage device. At the time of 
the system expansion, it becomes possible to give a suitable 
quantitative guidance to a system manager. The acceleration ratio at 
the time of the system expansion may be outputted to an output device 
such as a display device. 

10 Besides, the first to fifth aspects of the invention may further 

comprise the steps of: setting a performance magnification of a new 
parallel computer system relative to the parallel computer system; and 
calculating an estimated parallel efficiency by using the performance 
magnification of the new parallel computer system and storing it into 

15 the storage device. It becomes possible to give a quantitative guidance 
at a time of system replacement. The estimated parallel efficiency may 
be outputted to an output device such as a display device . 

Further, the first to fifth aspects of the invention may further 
comprise the steps of: calculating a system operational efficiency by 

20 dividing a product sum, with respect to all processings, of a sum of 
processing times of processings executed in the respective processors 
presently included in the parallel computer system and a calculated 
parallel efficiency, by a sum of working times of the respective 
processors presently included in the parallel computer system (for 

25 example, in accordance with an expression (17) in an embodiment) and 
storing it into the storage device. As compared with a conventional idea 
of a working efficiency, when the system operational efficiency in which 
consideration is given to the parallel efficiency is adopted as in the 
invention, it becomes possible to evaluate the system operational state 

30 in a more practical form. The system operational efficiency may be 



outputted to an output device such as a display device - 

Besides, the first to fifth aspects of the invention may further 
comprise the steps of: setting a target processing time; calculating 
a target parallel efficiency by using the target processing time and 
5 storing it into the storage device; and confirming propriety of the 
target parallel efficiency. For example, the target parallel efficiency 
can be calculated by linear extrapolation. 

Further, the present invention may further comprise the steps of: 
in a case where the propriety of the target parallel efficiency is 

10 confirmed, calculating a parallel efficiency after execution of tuning 
and storing it into the storage device; and comparing the parallel 
efficiency after the execution of the tuning with the target parallel 
efficiency- It becomes possible to execute the tuning of an application 
or the like from a more quantitative viewpoint. The parallel efficiency 

15 after execution of tuning and/or target parallel efficiency may be 
outputted to an output device such as a display device. 

Besides, the first to fifth aspects of the invention may further 
comprise the steps of: setting a target processing time; calculating 
an estimated value of the number of required processors for each 

20 different algorithm by using a parallel efficiency in each algorithm 
and storing it into the storage device; and extracting an algorithm in 
which the estimated value of the number of processors is smaller than 
an acceleration ratio representing a limit of improvement in a 
shortening degree of the processing time by parallelization of the 

25 processing by the algorithm and becomes a minimum value in the estimated 
values of the number of processors calculated on different algorithms - 
It becomes possible to quantitatively select the algorithm, which can 
further improve the parallel efficiency- The extracted algorithm may 
be outputted to an output device such as a display device. 

30 Incidentally, the parallel efficiency calculation method of the 
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invention can be executed by a program and a computer^ and in the case 
where the program is executed by the computer, the computer is a parallel 
efficiency calculation apparatus. Besides, such a program is stored in 
a storage medium or a storage device, for example, a flexible disk, a 
5 CD-ROM, a magneto-optical disk, a semiconductor memory, or a hard disk- 
Besides, there is also a case where the program is distributed as a 
digital signal through a network or the like- Incidentally, an 
intermediate processing result is temporarily stored in a memory . 

10 BRIEF DESCRIPTION OF THE DRAWINGS 

Fig. 1 is a drawing showing a state in which load balance is kept 
between processors; 

Fig- 2 is a drawing showing a classification example of processing 
15 times in respective processors; 

Fig. 3 is a drawing showing a state in which load balance is not 
kept among processors (a case in which four processors are assigned and 
a processing is performed by one of the processors) ; 

Fig- 4 is a drawing showing an example of modeling of the relation 
20 between Yi(p) and Yi (p) ; 

Fig. 5 is a drawing showing a state in which CPU performance varies 
and a data parallel processing is performed; 

Fig. 6A is a drawing showing a processing time in the case where 
a processing is performed by one processor; 
25 Fig. 6B is a drawing showing a processing time in the case where 

the processing is performed by four processors; 

Fig. 7A is a drawing showing a processing time in the case where 
times of a parallel processing part y and a communication part Xc are 
taken into consideration; 
30 Fig- 7B is a drawing showing a processing time in the case where 
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a task creation time Xtc is further taken into consideration; 

Fig. 8 is a drawing showing changes of parallel performance 
evaluation indexes in the case where a parallel performance impediment 
factor is added; 

5 Fig. 9 is a drawing showing an example of a case where although 

a load balance is kept between processors, a load balance between 
respective processing times is not kept; 

Fig. 10 is a drawing showing a processing time in the case where 
a processing is performed by one processor; 
10 Fig. 11 is a drawing showing an example of a case where a load 

balance is not kept between processors; 

Fig. 12 is a drawing showing the existence of a parallel 
performance impediment factor, which becomes apparent in the case where 
high parallelization is made; 
15 Fig. 13 is a drawing showing a calculation example of parallel 

performance evaluation indexes; 

Fig- 14 is a drawing showing the relation between a working time 
and the sum total of processing times; 

Fig, 15 is a drawing showing the relation among a working time, 
20 a processing time and a processing time with consideration given to a 
parallel efficiency; 

Fig. 16 is a drawing showing an example of a processing time in 
the case where data parallel is executed by a distributed memory parallel 
computer system; 

25 Fig. 17 is a drawing for comparison between a parallel performance 

evaluation index based on the original state CPU performance and an 
estimated parallel performance evaluation index in the case where the 
replacement with a system having CPU performance five times higher is 
made ; 

30 Fig. 18 is a drawing showing data for a trial calculation in the 
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case where the replacement with a system having CPU performance five 
times higher is made; 

Fig. 19 is a functional block diagram of an embodiment of the 
invention; 

5 Fig. 20 is a conceptual drawing expressing confirmation and count 

of the occurrence of an event by sampling; 

Fig. 21 is a drawing showing an example of a sampling result at 
the time of execution of a program indicated in table 1; 

Fig. 22 is a drawing showing an example of a processing flow of 
10 a parallel performance analyzer; 

Fig. 23 is a drawing showing an example of a measurement result 
of processing times by time measurement; 

Fig. 24 is a drawing showing an example of a measurement result 
of processing times by sampling; 
15 Fig. 25 is a drawing showing an example of a first portion of a 

processing flow of a processor number optimization processing; 

Fig. 26 is a drawing showing an example of a second portion of 
a processing flow of a processor number optimization processing; 

Fig. 27 is a drawing showing an example of a processing flow of 
20 a processor add-on estimation processing; 

Fig. 28 is a drawing showing an example of a processing flow of 
a system replacement data processing; 

Fig. 29 is a drawing showing a performance guidance concerning 
communication in the case where CPU performance is five times as higher 
25 and a target parallel efficiency is 0-6; 

Fig- 30 is a drawing showing an example of a processing flow for 
a system operational efficiency improvement processing; 

Fig- 31 is a drawing showing an example of a processing flow of 
a tuning processing; 
30 Fig. 32 is a drawing showing a change of parallel performance 
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evaluation indexes before tuning and after execution of first tuning; 

Fig. 33 is a drawing showing a processing time by a parallel 
processing program on the basis of an algorithm unsuited for a parallel 
processing; 

5 Fig. 34 is a drawing showing a processing time by a parallel 

processing program on the basis of an algorithm suited for a parallel 
processing; 

Fig. 35 is a drawing for comparison of parallel performance 
indexes between the algorithm unsuited for the parallel processing and 
10 the algorithm suited for the parallel processing; 

Fig- 36 is a drawing showing a processing flow of an algorithm 
selection processing; 

Fig. 37 is a drawing showing an example of log data of a parallel 
processing system; 

15 Fig. 38 is a drawing showing a measurement result of processing 

times in a first example; 

Fig. 39 is a drawing showing a calculation result of parallel 
performance evaluation indexes in the first example; 

Fig. 40 is a drawing showing a measurement result of processing 
20 times in a second example; 

Fig. 41 is a drawing showing a calculation result of parallel 
performance evaluation indexes in the second example; 

Fig. 42 is a drawing showing a measurement result of processing 
times in a third example; 
25 Fig. 43 is a drawing showing a calculation result of parallel 

performance evaluation indexes in the third example; 

Fig. 44 is a drawing showing a measurement result of processing 
times in a fourth example; 

Fig. 45 is a drawing showing a calculation result of parallel 
30 performance evaluation indexes in the fourth example; 
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Fig. 46 is a drawing showing a measurement result of processing 
times in a fifth example; 

Fig. 47 is a drawing showing a calculation result of parallel 
performance evaluation indexes in the fifth example; 
5 Fig. 48 is a drawing showing a measurement result of processing 

times in a sixth example (data parallel using redundancy processing) ; 

Fig- 49 is a drawing showing a calculation result of parallel 
performance evaluation indexes in the sixth example (data parallel using 
redundancy processing) ; 
10 Fig. 50 is a drawing showing a measurement result of processing 

times in the sixth example (data parallel in which a portion which can 
not be processed in parallel is processed by a specific processor) ; 

Fig- 51 is a drawing showing a calculation result of parallel 
performance evaluation indexes in the sixth example (data parallel in 
15 which the portion which can not be processed in parallel is processed 
by the specific processor) ; 

Fig- 52 is a drawing showing a measurement result of processing 
times in a seventh example; 

Fig- 53 is a drawing showing a calculation result of parallel 
20 performance evaluation indexes in the seventh example; 

Fig. 54 is a drawing showing a measurement result of processing 
times in an eighth example; 

Fig- 55 is a drawing showing a calculation result of parallel 
performance evaluation indexes in the eighth example; 
25 Fig. 56 is a drawing showing a measurement result of processing 

times in a ninth example; 

Fig. 57 is a drawing showing a calculation result of parallel 
performance evaluation indexes in the ninth example; 

Fig. 58 is a drawing showing a measurement result of processing 
30 times in a tenth example; and 
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Fig. 59 is a drawing showing a calculation result of parallel 
performance evaluation indexes in the tenth parallel example . 

DETAIL DESCRIPTION OF THE PREFERRED EMBODIMENTS 

5 

[Principle of the invention] 

In the present invention, a parallel efficiency Ep(p) is 
described by parallel performance evaluation indexes so that the 
parallel efficiency Ep (p) is quantitatively connected to parallel 

10 performance impediment factors. As shown in Fig. 2, is processing time 
Ti (p) can be expressed as expression (3) using the sum of a processing 
time Yi (p) of a parallel calculation part and processing times Xi,j (P) 
of respective parallel performance impediment factors j and can be 
represented as expression (3) - Here, 1 < j < jothers- Incidentally, in 

15 Fig. 2, i is a processor number, p is the number of processors. Besides, 
Fig. 2 shows only a processor i and a processor i+1. 

Jothers 

7=1 

The expression (1) is transformed as set forth below, and the 
parallel efficiency Ep (p) is described by introducing, as parallel 
20 performance evaluation indexes, a load balance contribution ratio Rb (p) , 
a virtual parallelization ratio Rp (p) , and a parallel performance 
impediment factor contribution ratio Rj (p) . 
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Incidentally, a transformation from the expression (1) to the 
expression (4-1) is made by multiplying a numerator and a denominator 
of the expression (1) by the sum of Xi (p) with respect to i. Besides , 
a transformation to the expression (4-2) is made by changing positions 
of the respective elements in the expression (4-1) and multiplying a 
numerator and a denominator of the expression (4-1) by the sum of Yi(p) 
with respect to i. Besides, a following expression is derived from the 
expression (3) . This expresses the sum of Ti(p) with respect to i. 

P P Jothers P 

Y.'^i{p)=Y,ri{p)^ X Y.Xijip) 

/=i /=i y=i /=i 

Then, a following expression is derived by the load balance 
contribution ratio Rb (p) , the virtual parallelization ratio Rp(p) , and 
the parallel performance impediment factor contribution ratio Rj (p) - 
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Besides, when an acceleration ratio Ap (p) is used, the expression 
is also expressed as follows: 
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Incidentally, the load balance contribution ratio Rb{p>, the 
virtual parallelization ratio Rp (p) / the parallel performance 
impediment factor contribution ratio Rj (p) , and the acceleration ratio 
Ap (p) are expressed as follows: 

RbiP)^ '=} , (!//>< i?6(/>)<l) (5) 

KiP) = (0<i?„(/>)<l) (6-1) 



Ar,ip) = 



r(l) 
1 



l-Rpip) 



(6-2) 



P 

ILxij(p) 

Rj(p)^^ (O^Rjip)<l) (7) 

t,Mp) 
1=1 

Incidentally, the parallel performance impediment factor is 
numbered j . 

A state in which a load balance is kept is a state in which the 
processing times Xi (p) of processors are uniform as shown in Fig. 1. The 



expression (5) expresses this state as Rb(p) = 1, and expresses, as 1/p 
< Rb(p) < 1, a state in which the load balance is not kept. As shown 
in Fig. 3, in the case where a processing is performed by only one 
processor at the time of a parallel processing, since the numerator of 
5 the expression (5) becomes x (p) / the load balance contribution ratio 
Rb (p) becomes a lower limit value of 1/p. Besides, according to the 
expression (5), the load balance contribution ratio Rb(p) becomes the 
proportion of the parallel efficiency Ep (p) , and this facilitates an 
intuitive grasp of the parallel performance. 

10 The virtual parallelization ratio Rp(p) is a ratio of the sum of 

the processing times yi of the parallel calculation parts to Xi(l) - In 
the case where this is less than 1, this indicates that the processing 
includes a processing, which cannot be processed in parallel. By this 
ratio, the upper limit of the parallel performance can be expressed as 

15 the acceleration ratio Ap(p) . The acceleration ratio Ap (p) is an ideal 
upper limit value Ap (p) = x ( 1) /SjXi, j ( D = 1/ (1 - (SiYi (p) /x ( 1) ) at the 
time when processors are infinitely applied. A normal value of x(l)/x(p) 
becomes a value smaller than Ap(p) because of the parallel performance 
impediment factors . 

20 Since the parallel performance impediment factor contribution 

ratio Rj (p) is normalized by the sum of Xi (p) with respect to i as 
expressed by the expression (7) , irrespective of high parallelization 
or low parallelization, the contribution of the parallel performance 
impediment factor can be grasped by the ratio of the processing time. 

25 Besides, since this ratio becomes the proportion of the parallel 
efficiency, an impediment to the parallel performance can be 
quantitatively grasped. 

All variables of the expressions (2) , (5) , (6-1) and (7) except 
x(l) can be measured at the time of parallel execution. When expression 

30 (8-1) is established, the virtual parallelization ratio Rp(p) becomes 
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10 



15 



substantially 1 from the expression (6-1) , and the expression (4-4) 
becomes equal to expression (8-2) - 



^(i) = I,r/(p) 



(8-1) 



^ Jothers ^ 



(8-2) 



That is, since the parallel efficiency Ep (p) is not required to 
use the estimated value of t(1), it can be accurately determined- On 
the other hand, irrespective of the condition expression (8-1) , it is 
also possible to use the expression (8-2) as a substitute value for the 
expression (4-4) or (4-5) in the parallel performance evaluation- In 
this case, since the value of the expression (8-2) becomes a value equal 
to or less than the value of the expression (4-4) or (4-5) because of 
Rp(p) < 1- 

The parallel efficiency Ep (p) can also be calculated by the above 
described expressions (4-4) , (4-5) and (8-2) and a following expression 



(9-1) 



EAp) = Rb{p) 



1 



i=l 



(9-1) 



/=1 

The expression (9-1) is a result of transformation of the 
expression (4-3) by using only the load balance contribution ratio Rb(p) 
and the virtual parallelization ratio Rp(p) . 

Besides, 'C(l) becomes expression (10) from the expression (3). 
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JOtkers 

Ti(l) = 7i(l)+ £zij(l) (10) 

Here, Yi(l) ^"^^ (1) modeled using Yi(p) and (p) . Since 

it is impossible to make an actual measurement of Ti(1) in a parallel 
processing by a computer having different CPU performances as in the 
5 grid or the cluster, t{1) can be determined by this modeling, and it 
becomes possible to calculate the virtual parallelization ratio Rp(p) 
of the expression (6-1) . 

In an ideal case where processor performances are identical, when 
a parallel calculation part is processed by p processors, a processing 
10 time becomes 1/p as compared with p = 1. In this case, Ti (p) = Yi (p) , 
and when Yi (P) of arbitrary processor is multiplied by a factor of 
p, Yi(i) can be obtained. On the other hand, computers having different 
CPU performances generally exist on the grid or the cluster, on the basis 
of Yi(P) actually measured in the p processors, Yi(l) the case where 
15 the number of processors is 1 is presumed as in expression (11) . 

p 

ri(i) = S^/(/^) ^^^^ 
1=1 

Fig- 4 is a conceptual view of this expression (11) - By modeling 
of the expression (11), even in the case where performances of the 
respective processors are different from each other, the time of Yi(i) 
20 of one processor can be virtually determined. 

Besides, Xi,j (1) is divided into two parts of a redundancy 
processing and the others for modeling- It is assumed that all of 

processing times not belonging to Yi(l) included in Xio (1) • 

(1) Modeling of redundancy processing time 

25 When the respective processors perform the same processing, it 

is called a redundancy processing here. This processing is not a parallel 



processing, and even if the number of processors is increased, a 
processing time is not decreased. Then, j = 1 is assigned to the 
redundancy processing, and its time Xi,i(l) is modeled as in expressions 
(12-1) to (12-4) . 



1 ^ 

P i=i 

1=1 

Zui(^)^Min(zuiip)) (12-3) 

i=l 

Xl,lCO = Xu,liP) (12-4) 

Where, ii is ii in a following expression. 



p 

Tii(p) ^ Max 
1=1 



^ Joihers ^ 

ri{p)+ Y^Xijip) 



The redundancy processing is a processing often performed in 
10 so-called data parallel in which processings having the same procedure 
(processing content) and different data are processed in parallel. In 
the case of the data parallel, in order to keep a load balance, it is 
assumed to be a parallel processing performed by processors having the 
same CPU performance. Then, it is possible to consider that a difference 
15 in redundancy processing time between the respective processors is due 
to a fluctuation caused by time measurements of the respective 
processors- In this case, it is proper to apply the expression (12-1) 
in which measurement values of the respective processors are averaged. 

On the other hand, it is supposed that a processor having a 
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different performance gets mixed in the grid or the cluster- In the case 
where an attempt to accurately grasp the influence of the processor 
having the different performance is made, the expressions (12-2) and 
(12-3) are used. When the expression (12-2) is used, the virtual 
5 parallelization ratio Rp(p) of the expression (6-1) is estimated to be 
a minimum, and the parallel efficiency Ep (p) becomes maximum. When the 
expression (12-3) is used, the virtual parallelization ratio Rp(p) is 
estimated to be a maximum, and the parallel efficiency Ep (p) becomes 
minimum. When these two parallel efficiencies Ep(p) are compared, it 

10 becomes possible to detect that the data parallel processing is 
performed by processors having different CPU performances. 

The processing time T(p) is determined by the expression (2) , and 
the redundancy processing time of the processor i is a value of the 
expression (12-4) - Accordingly, when consideration is given to the 

15 analysis of data having determined the parallel efficiency Ep(p) , it is 
proper to use the expression (12-4) . On the other hand, this expression 
means that Xi,i{l) is determined by only information of the processor 
ii, and has a defect that a case greatly different from a value of another 
processor cannot be detected- As this example. Fig. 5 shows a case where 

20 a processor 1 (i = 1) has CPU performance of 1/5 - In the expression (12-4) , 
a time of a redundancy processing is evaluated by only a value of the 
processor 1. In the expression (12-1), an evaluation is made by an 
average of respective values of four processors. In the expression 
(12-2) , an evaluation is made by a value of the processor 1, and in the 

25 expression (12-3) , an evaluation is made by values of the processors 
of i = 2, 3 and 4. Accordingly, it is proper that the expression (12-1) 
is basically used, and another definition is used as the need arises. 

(2) Modeling of Xi,j (1) (2 < j < jothers) 

When a processing time due to a parallel performance impediment 

30 factor is actually measured, there is a case of Xio (1) ^ 0. Since this 
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processing time is not decreased even if a parallel processing is 
performed, it is reflected in the virtual parallelization ratio Rp(p) 
of the expression (6-1) , the acceleration ratio Ap of the expression 
(6-2) becomes a finite value, and an upper limit value of the number 
of processors is determined which has a meaning when they are applied 

in the processing. Then, a processing time Xi,j (1) (2 < j < jothers) clue 
to a parallel performance impediment factor other than the redundancy 
processing is modeled into the expression (13-1) by the processing time 
Xi,j (1) at p = 1 and the expression (13-2) expressing the processing time 
at p > 1, and Xi,j (P) ^ processing time Xi,j (p) due to a parallel 

performance impediment factor occurring at p > 1 and dependent on p are 

measured to obtain Xi,j(l). That is, Xi,j (1) = (P) " Xi,j (p) , and both 
of two items of the right side are obtained by measurement. 



As an example. Fig. 6A shows processing times at p = 1 in the case 



of Xi/j (1) ^ 0 p = 1, and Fig. 6B shows a processing times at p = 

4- As shown in Fig. 6B, Xi,2 (p) at the time of the parallel processing 
becomes as expressed by the expression (13-1) in which Xi,2(p) is added 
to the processing time Xi/2(1) at p = 1. Such a phenomenon is observed 
in a case where a pre-preprocessing executed till communication hardware 
is activated in communication or the like is executed also at p = 1. 



From the expressions (13-1) and (13-2) , similarly to the 



redundancy processing, Xi,i(l) (2 ^ j < jothers) can be obtained by 
expressions (13-3), (13-4) and (13-5). 




(13-1) 




(13-2) 



(13-3) 

(13-4) 
(13-5) 

For example^ in Fig. 6B, since Xi,2 (p) =Xi,2(l) +Xi,2(p) is obtained 
by an actual measurement, Xi,2 (p) = 5, 6, 1, 8 are actually measured, 
and Xi,2(l) = 5 can be calculated from the expressions (13-3), (13-4) 
and (13-5). This value is coincident with Xi,2(l) of Fig. 6A, 
(3) Determination method of Xi, jothers (p) 

The processing time Xi, jothers (p) / which cannot be actually measured 
in the classification of parallel performance impediment factors, is 
obtained by expression (13-6) . 

Jothers 

Xijo,^,(P)='^iiP)-7i(P)- LXijiP) (13-6) 

Next, from the expression (11) obtained by the modeling, the 
expression (10) is transformed into a following expression, 

P Jothers 

<i) = Xri(p)+ X^uO) 

(=1 y=i 

Besides, the expression (8-1) is transformed as follows: 
P Jothers P 

i=i j=i i=i 

It is necessary to satisfy the condition of expression (14) in 
order to establish this expression- Since this conditional expression 
is a comparison between magnitudes of values calculated from measurement 
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values, a judgment can be specifically made. 



P Jothers 



Zjyi^p)» X>t^ijW ^^^^ 



The processing time Yi(P) of this expression is obtained by an 
actual measurement. Besides, the sum of Xi,j(l) with respect to j can 
5 be calculated from the modeling expressions (12-1) , (12-2) , (12-3) , and 
(12-4) and the expressions (13-1), (13-2), (13-3), (13-4) and (13-5). 
As a result, the judgment of the expression (8-1) is first specifically 
enabled by using the expression (14) . For example, in the following case, 
the condition of the expression (14) is established, the virtual 
10 parallelization ratio Rp (p) of the expression (6-1) becomes 
approximately 1, the expressions (4-4) and (4-5) become equal to the 
expressions (8-2), and an accurate parallel efficiency Ep (p) can be 
obtained in the meaning that an influence of t(1) as an estimated value 
becomes approximately zero. 



specifically calculated from the expressions (10), (11), (12-1), (12-2), 
(12-3), (12-4), (13-1), (13-2), (13-3), (13-4), (13-5) and (13-6) . The 

expression (15) expresses x(l) as the sum of the total sum of parallel 
20 processing times yi(p) of the respective processors and the sum of 
processing times Xio(l) "^h® parallel efficiency impediment 

factors at p = 1. 



15 




Besides, x(l) also becomes expression (15) which can be 



P 



Jothers 





From the above-described expressions (1), (2) and (15), following 
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expression (9-2) is obtained. 

JOthers P 

Ep(.P)^— — 

T(p) • p 

The expressions (9-1) and (9-2) use the sum of the times Yi (p) 

of the parallel calculation part with respect to i, and as compared with 
5 the expressions (4-4) and (4-5) , there is a merit that the parallel 

efficiency Ep(p) is calculated without data of Xi/j (P) • However, data of 

Xi,j(l) is necessary. 

As indicated by the expressions (4-4) , (4-5) and (7) , in the 

invention, an arbitrary number of parallel performance impediment 
10 factors j can be added- Figs. 7A and 7B show examples of adding the 

parallel performance impediment factor. Fig. 7B shows a case where a 

time measurement is made in view of a task creation time Xxc/ and Fig. 

7A shows a processing time in the case where a time measurement of the 

task creation time Xic is not made in the same processing- In the case 
15 of Fig. 7A, a following calculation is carried out- 

Ti = 10 + 5 + 90 + 20 + 20 = 145 

T2 = 10 + 80 + 10 = 100 

X3 = 15 + 80 + 10 = 105 

X4 = 10 + 90 + 10 = 110 
20 Rb(4) = (145 + 100 + 105 + 110)/(145 x 4) = 0-7931 

Rc(4) = (25 + 20 + 25 + 20) /460 = 0.1975 

Rp(4) = 1 

Ep(4) = 0.7931 X 1 X (1 - 0.1957) = 0.6379 

Besides, in the case of Fig. 7B, a following calculation is 
25 carried out. 

Ti = 10 + 5 + 90 + 20 + 20 = 145 

T2 = 5 + 10 + 80 + 10 = 105 
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T3 = 10 + 15 + 80 + 10 = 115 
t4 = 15 + 10 + 90 + 10 = 125 

Rb(4) = (145 + 105 + 115 + 125) / (145 x 4) = 0.8448 
Rc(4) = (25 + 20 + 25 + 20) /490 = 0.1837 
5 Rtc(4) = (0 + 5 + 10 + 15)/490 = 0.0612 
Rp(4) = 1 

Ep(4) = 0 .8448 X 1 X (1 - 0 .1837 - 0.0612) = 0.6379 

Fig. 8 shows values of these parallel performance evaluation 
indexes together- As compared with the values (case 1) calculated from 

10 Fig. lA, it is understood that with respect to the values (case 2) 
calculated from Fig. IB, although Rc is decreased and Rb is increased 
by the addition of Rtc for the rising time, Ep is the same. In this case, 
Ep is not changed by adding the parallel performance impediment factor, 
but the details become clearer. 

15 By expressing the load balance contribution ratio Rb (p) as 

indicated by the expression (5) , the load balance and the parallel 
efficiency Ep(p) can be correlated with each other. The reason why the 
load balance contribution ratio Rb (p) is defined as the expression (5) 
is that it is possible to consider a case where as shown in Fig. 9, the 

20 load balance is kept in a state where contribution of the parallel 
performance impediment factor varies in the respective processors- In 
Fig. 9, for example, a parallel processing portion of a processor 1 is 
very small as compared with the others, and a redundancy processing is 
very large. However, since the processing times of all the processors 

25 are coincident with each other, the load balance is kept. That is, a 
state is such that although Yi (P) Xi,j (P) not individually 

balanced, they are balanced in total. Incidentally, Xi, jothers (p) '^i ~ 
Yi ~ %i,2) is a processing time due to, for example, I/O- 

In the case of Fig. 9, the load balance contribution ratio Rb (p) 

30 is 1. As shown in Fig. 10, in the case where a processing is performed 



by only one processor in a parallel processing, Rb (p) becomes a lower 
limit 1/p. Besides, as shown in Fig- 11, although processing times of 
the processor 1 and the processor 2 are coincident with each other, they 
are not coincident with the processing times of the processors 3 and 
4, and the load balance is not kept. In this case, Rb(p) becomes as 
follows : 



R, ip) = M = 100 + 100 + 80 + 70 ^ ^^^^^ 

T(p) •/) 100x4 

Further, there is a case where a parallel performance impediment 
factor, which does not become apparent in a low parallel, becomes 

10 apparent in a high parallel . In the parallelization ratio (= (processing 
time of parallel processing part at p = 1) / ( (processing time of parallel 
processing part at p = 1) + (processing time of portion which can not 
be processed in parallel at p = 1) ) ) , which is one of conventional 
performance evaluation indexes, this phenomenon cannot be sufficiently 

15 grasped. For example, in an example of Fig. 12, the parallelization ratio 
at p = 1 is 0 . 99 (= 198/ (198 + 2) ) , and the remainder of 0 . 01 is a ratio 
of a processing time which cannot be processed in parallel. However, 
this value is two hours even in a high parallel such as the case of p 
= 100 in the drawing, and does not reflect the reality that a portion 

20 which can not be processed in parallel occupies 50 % ( 2/ (1.98 + 2) ) . 
In the invention, as shown in the expression (7) , the parallel 
performance impediment factor Rj (p) is expressed as a value obtained by 
normalizing the sum of (p) with respect to i using the sum of Xi(p) 
with respect to i. By this normalization, also when x(p) becomes a small 

25 value in a high parallel, an upper limit of Rj (p) becomes 1, and 
influences of the respective parallel performance impediment factors 
can be expressed by a percentage at the time of the parallel processing. 
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As described above, the parallel efficiency Ep(p) is calculated, 
and parallel performance evaluation indexes Rb (p) , Rp (p) , Rj (p) (Rred (4) / 
Rc(4), ... / Rothers(4))r and auxil iary indexes Ap (p ) and Ep (p) • p can also 
be calculated. Fig. 13 shows an example of this calculation result. The 
5 parallel performance can be quantitatively expressed by the eight items 
shown in Fig. 13. 

As shown in Fig. 13, since Ep(p) -p = 1.777, it is understood that 
although the parallel computer system has a four-processor 
configuration, the processing is performed by a performance of 1.777 
10 processors. The parallel efficiency is lowered to 94% (Rb(4) == 0.9392) 
by the load balance contribution ratio. The influence of the parallel 
performance impediment factor is a redundancy processing of 22% (Rred (4) 
= 0 .2230), a communication of 33% (Rc(4) = 0.3309), and the others of 
3% {Rothers(4) — 0.028 8) . Accordingly, the communication and redundancy 
15 processing lower the parallel per formance by 55% . In Fig. 13, since Rp (4) 
= 0-8821, it is possible to estimate that the parallel maximum 
performance at the time when processors are infinitely applied is 8.482 
(= Ap(4) = 1/ (1 - 0.8821)) times higher than that of one processor. 
Accordingly, it is understood that this processing is a processing to 
20 be performed with 8 processors or less. 

Besides, in the case where a setting target value (Ep)T of the 
parallel efficiency is set to 0.8, when a processing group as shown in 
Fig. 13 is supposed, the optimum number of processors is calculated by 
a following expression - 
25 (p)oPT = Ep(4) /(Ep)T-p 

= 0.4443/0.8 X 4 = 2.215 

Accordingly, an estimated value of the optimum number of 
processors becomes (p)opt — 2. 

Incidentally, the processing group means plural processings in 
30 which the same function is used in the same application program and only 
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input data is changed, and is a processing frequently executed in 
parametric studies of scientific calculation and so on. 

Conventionally, the evaluation of a parallel computer system has 
been performed by a working rate (Net Working Rate) NWRsystem indicated 
by following expression (16) . However, since there is also a case where 
a processing having a low parallel efficiency is included, even if the 
working rate is good, the operational efficiency of the system is not 
necessarily high. 



max 



NWR 



k 



(16) 



System - p^^^^^ 
1=1 



Fig. 14 shows an example of a working time and the total sum 
(following expression) of processing times. 



K 



max 



k=\ 



i=l 



In Fig. 14, the total sum of the processing times is decreased 
relative to the working time Ti . The degree of the decrease varies for 
processors . 

According to the invention, the index of the system operational 
efficiency Esystem is introduced on the basis of the expression (16) , and 
it becomes possible to evaluate the operational efficiency of the system. 
It becomes possible to give a specific guideline for improvement of the 
operational efficiency, for example, it is possible to give a guideline 
that for the improvement of the index, parallel efficiency of which 
processing must be improved to what extent. 



ma x 



'System 



-ik 



P System 
i=l 



(17) 



For example, if Psystem = 4, Ti = 10, K^ax = 2, and a following 
condition is further satisfied, then Esystem = (5 + 9)/ (10 + 10 + 10 + 
10) = 0.35. 



'=1 

z=l 



= 0.5x(4 + 3 + 2 + l) = 5 



= lx(9)=9 



Incidentally, a conventional working rate becomes NWRsystem = (10 
+ 9) /40 = 0.4838. By considering the parallel efficiency, it is possible 
to estimate the operational efficiency of the system, in which 
consideration is given to a time wasted in each processing by the 

10 parallel processing. 

Fig. 15 shows the above-described working time, the sum of Ti(p) 
in processing 1 as the parallel processing, the product of ti(p) in the 
processing 1 and the parallel efficiency, the sum of Xi (p) in processing 
2 (only running on the processor 4) as a non-parallel processing, and 

15 the product of Xi (p) in the processing 2 and the parallel efficiency. 
In Fig. 15, in the case of the parallel processing, Xi(p) is shorter than 
the working time Ti, and when the parallel efficiency is further 
considered, it becomes further short since a wasteful processing time 
is removed. On the other hand, in the case of the non-parallel processing, 

20 since the parallel efficiency becomes 1, both Xi(p) in the processing 
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2 and the product of Xi (p) and the parallel efficiency become the same 
value - 

As base data of processor add-on of a parallel computer system, 
a working rate of a system has been conventionally used. However, since 
effectively used resources of the system are not the bases, there is 
a possibility that resource add-on or replacement is made for a 
processing having a low parallel efficiency. According to the invention, 
it becomes possible to give a quantitative guideline to the processor 
add-on of the parallel computer system. When the number of all processors 
in the system is Psystem/ Ti is a working time of each processor, PAdd is 
the number of processors after add-on, kmax is the number of all 
processings, and a is a predicted parallel efficiency, an acceleration 

PAdd 

ratio Asystem ^t the time when an increase of the working time ' Psystem 
is made by additional processors, becomes as shown in expression (18) . 



K 



A 



max 

s 



1=1 



PAdd 

+«• Y^^i 

^^P System +1 



system 



PSystem 



(18) 



For example, if a = 1 under conditions as indicated below, then 
the acceleration ratio Asystem is calculated as follows: 



PSystem 

i=l 



PAdd 

40, £7;. 

^"^PSystem^^ 



K 



max 



10, X 



= 39 



Asystem = (39 + 1 X 10) /40 = 1.23 

As stated above, system expansion of about 23% is obtained. This 

35 



value becomes a value more persuasive than a conventional working rate 
in relation to the expansion in a viewpoint that the parallel efficiency 
of the processing before the expansion is considered. When the 
acceleration ratio is calculated by multiplying a system expansion by 
5 a predicted parallel efficiency a, a more realistic value is obtained. 
Besides, when the CPU capability of the expanded processors is made ten 
times as higher, Asystem can also be obtained under a = 10. In the above 
example, Agystem = (39 + 10 x 10) /40 = 3.48. By this, also with respect 
to the expansion of processors having different CPU performances, it 

10 becomes possible to prepare predicted data with a base more reliable 
than that based on the working rate. 

Besides, according to the invention, it becomes also possible to 
give a quantitative guideline to a replacement of a parallel computer 
system. By the indexes (parallel efficiency, load balance contribution 

15 ratio, virtual paralleli zation ratio, parallel performance impediment 

factor contribution ratio, Xi (p) , Yi (P) / (P) ^ working time Ti of each 
processor) calculated for each processing, it becomes possible to 
predict a parallel efficiency for each processing as indicated in a 
following example, and it becomes possible to estimate performance of 

20 the system after the system replacement. 

For example, when consideration is given to the introduction of 
a system having CPU performance five times higher than that of a system 
in which elapsed times as shown in Fig. 16 are measured, Yi (p) , Xi^redCp) / 
and Xi/Others(p) become 1/5- On the other hand, when it is assumed that 

25 Xi,c(p) depends on network performance and the performance is the same 
this time, a parallel efficiency of a new system can be estimated as 
follows : 

Incidentally, if Xi,c(l) = 0, and Xi, others (D = 0, then the 
performance evaluation index in the case where the CPU performance 
30 becomes five times as higher can be calculated as follows. Besides, by 
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the above-described expression (12-1), Xi,RED(l) can be expressed as 
follows- Besides, the load balance contribution ratio is calculated in 
accordance with the expression (5) , the virtual parallelization ratio 
is calculated in accordance with the expression (6-1) , the parallel 
performance impediment factor contribution ratio (redundancy 
processing, communication, others) is calculated in accordance with the 
expression (7) , and the parallel efficiency is calculated in accordance 
with the expressions (4-4) and (9-1) as follows. 



Xi,redO) 



-'Z;fiW/^) = T(8+^+7+7)/5 = 1.550 

p rf 4 



p 



(15+8 + l)/5+10+(14+9+l)/5 + ll + (13 + 7+l)/5+12 + (16+7+l)/5+13 

((16 + 7+l)/5+13)-4 



14.8+15.8+16.2+17.8 
17.8-4 



0.9073 



P 




(15+14+13 + 16)/5 _ 58/5 



= 0.8821 



T(l) 



(15 + 14+13+16)/5 + 7.75/5 (58 + 7.75)/5 



4 



Rred(p) = 




(8 + 9 + 7 + 7)/5 
64.6 



= 0.0960 



4 




1=1 
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^(p)=&f!!!=10±ii±i2±13 = o.7121 

1=1 

P 

Xi ^Others ^P^ - , 

Ro^„(p) = J^ (l±i±i±iV5 = 0.0124 

1=1 

(J3) = Rb iP) • • (1 - " iP) - Rothers iP) ) 

= 0.9073 ^- (1 - 0.0960-0.7121-0.0124)= 0. 1846 

0.8821 ^ 

p 



1 Sr/(P) 

i i=i -nom^i ^ ^^^^ 



£^ (p) = i?^ ^ = 0.9073 ^^^^ = 0. 1 847 

1=1 

5 The above-described calculation results and performance indexes 

based on the actual measurement are collected as shown in Fig. 17. As 
shown in a table of Fig. 17, by predicting the parallel performance when 
the system is replaced, the operational efficiency Esystem of ^ new system 
can be estimated. For that purpose, log data of the previous system is 
10 used, and all performance indexes to the previous processing are 
calculated similarly to Fig- 17- 

When a trial calculation is carried out to obtain the system 
operational efficiency Egystem at the time when the CPU performance is 
made five times as higher, it becomes as follows. By comparing the Egystem 
15 calculated on the basis of the estimated values with the Esystem calculated 
from previous actual values, data with a more reliable base as compared 
with the working rate can be obtained for the replacement of the system- 
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As shown in Fig. 18, Esystem can be calculated in accordance with 
Ep(4) of each processing in the case where the CPU performance becomes 

five times higher, the sum of Xi (p) with respect to i, and the number 
of processors. Incidentally, a following condition is used as a premise - 

PSystem 

PSystem=^^> X^=4000, /b^^x = 4 
/=1 



^''System ~~ 



PSystem 

0.1846x64.6 + 0.7219x2000.3 + 03000x512.1 + 1x1000 

4000 



= 0.6524 



[Description of embodiments] 

Fig. 19 is a drawing showing a system outline of one embodiment 

10 of the invention. A parallel performance analyzer 100 is a computer with 
a single processor for analyzing parallel performance of a parallel 
computer system 200, and is connected to an output device 110 such as 
a printer or a display device. However, the parallel performance 
analyzer 100 may be a parallel computer. The parallel performance 

15 analyzer 100 includes a data acquision unit 10, a load balance 
contribution ratio calculator 11, a virtual parallelization ratio 
calculator 12, a parallel performance impediment factor contribution 
ratio calculator 13, a parallel efficiency calculator 14, an auxiliary 
index calculator 15, a processor number optimizer 21, a processor add-on 

20 estimation processor 22, a system replacement data processor 23, an 
operational efficiency data processor 24, a tuning processor 25, an 



algorithm selection processor 26, and a parallel performance evaluation 
processor 27. The parallel performance analyzer 100 is connected to a 
log data storage 30. The parallel computer system 200 includes a 
measurement unit 201. For example, the parallel performance analyzer 
5 100 is connected to the parallel computer system 200 through a network. 

The measurement unit 201 of the parallel computer system 200 
measures respective processing times ji (p) , Xi,j (p) / and Xi (p) , while 
executing a parallel processing in accordance with a program. For 
example, a time from a start to an end of each processing is measured 

10 by a timer, or a start time and an end time of each processing are recorded, 
and a processing time is computed after the end of the processing. The 
measurement of the time may be performed by software including the 
operating system (OS) or hardware. Data of measured processing times 
is once stored in a memory of the parallel computer system 200, and/or 

15 is stored in other storage devices according to circumstances. 

Besides, there is also a case where instead of the measurement 
of the processing times, events of a program under execution are 
confirmed at predetermined time intervals, and the respective events 
are counted. Such a measurement is called a measurement by sampling - 

20 It becomes possible to adopt such a measurement by sampling since the 
expressions (4-4), (9-1), and (9-2) and Rb (p) , Rp (p) and Rj (p) have forms 
of time ratios. Although there is a difference due to measurement 
accuracy, the method by the time measurement and the method by the 
sampling have the same result. 

25 Fig. 20 is a conceptual view of the measurement by the sampling - 

Fig- 20 shows a state in which a time passes from the left to the right. 
In Fig- 20, a downward arrow indicates timing of the sampling, and the 
sampling is performed at predetermined time intervals as indicated by 
the intervals between the downward arrows . In Fig. 20, after a redundancy 

30 processing is first executed for Xi,RED(p)f a parallel calculation is 



carried out for yi (p) . Incidentally, the processing is executed for Xi (p) 
in total- The number of times of sampling is seven in the event of the 
redundancy processing continuing for Xi,RED(p) / sind nine in the event of 
the parallel calculation continuing for Yi (p) - In the whole processing 
5 time Xi (p) , the number of times of sampling is 22. In the parallel 
performance impediment factors, events other than intentionally 
measured Xi/RED(p) ^^e collectively expressed by Xi, others (p) / and it is 
calculated from the expression {13-6) using Xi.RSDCp) and yi (p) . In the 
example of Fig. 20, it is understood that the number of times of sampling 

10 during Xi, others (p) is six (=22-9-7), 

The summary of how to carry out the measurement actually by the 
sampling will be described below. 
(1) Portion of Xi (p) 

(a) A flag for an event Xi (p) is turned ON at a start of a processing, 

15 and is turned OFF at an end of the processing. At the time of execution, 
it is discriminated at predetermined time intervals whether the flag 
for the event Xi (p) is ON/OFF, and the number of times that it is 
discriminated that the flag is ON is counted to obtain the number of 
times of sampling. 

20 A description and a processing in the following methods are 

combined as the need arises, and a measurement is made. 

• A programmer detects a start and an end of a processing in a 
program, that is, a position where the flag is to be turned ON/OFF, and 
gives a description for turning the flag ON/OFF. 

25 'In the case where a parallel language extension, a compiler 

directive or the like is used, a tool interprets the parallel language 
extension, the compiler directive or the like, and gives a description 
for turning the flag ON/OFF. 

• In the case where a parallel language extension, a compiler 
30 directive or the like is used, a compiler interprets the parallel 
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language extension, the compiler directive or the like, and gives a 
description for turning the flag ON/OFF . 

• A complier detects a start and an end of a processing in a program, 
that is, a position where the flag is to be turned ON/OFF, and gives 
5 a description for turning the flag ON/OFF. 

• An OS detects a start and an end of a processing in a program, 
that is, a position where the flag is to be turned ON/OFF, and gives 
a description for turning the flag ON/OFF. 

• A runtime library detects a start and an end of a processing 
10 in a program, that is, a position where the flag is to be turned ON/OFF, 

and gives a description for turning the flag ON/OFF. 

• Hardware detects a start and an end of a processing in a program, 
that is, a position where the flag is to be turned ON/OFF, and gives 
a description for turning the flag ON/OFF. 

15 • A description for a processing of discriminating that the flag 

is ON and counting the number of times is given at a complier level. 

• A description for a processing of discriminating that the flag 
is ON and counting the number of times is given at an OS level . 

• A description for a processing of discriminating that the flag 
20 is ON and counting the number of times is given at a runtime library 

level . 

• A description for a processing of discriminating that the flag 
is ON and counting the number of times is given at a hardware level . 

• A description for a processing of discriminating that the flag 
25 is ON and counting the number of times is given at a tool level. 

• A description for a processing of discriminating that the flag 
is ON and counting the number of times is given at a program level - 

• A processing of discriminating that the flag is ON and counting 
the number of times is executed at a hardware level . 

30 (b) An event is specified by a program name or an execution module 
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name substituting for that, and at the time of execution, the program 
name or the execution module name is discriminated at predetermined time 
intervals, and the number of times that the name is discriminated is 
counted to obtain the number of times of sampling. 
5 A name creation method in the following methods, a discrimination 

processing and a count processing are combined as the need arises, and 
a measurement is made. 

• A complier creates the program name or the execution module 

name . 

10 • An OS creates the program name or the execution module name. 

- A runtime library creates the program name or the execution 
module name . 

' Hardware creates the program name or the execution module name . 

• The program name or the execution module name is created by a 
15 description of a parallel language extension or a complier directive. 

• The program name or the execution module name is created by a 
description of a programmer. 

• A description for a discrimination processing of the created 
program name or execution module name and a count processing is given 

20 at a complier level. 

• A description for a discrimination processing of the created 
program name or execution module name and a count processing is given 
at an OS level. 

• A description for a discrimination processing of the created 
25 program name or execution module name and a count processing is given 

at a runtime library level. 

• A discrimination processing of the created program name or 
execution module name and a count processing is executed at hardware 
level - 

30 • A description for a discrimination processing of the created 
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program name or execution module name and a count processing is given 
at a tool level. 

• A description for a discrimination processing of the created 
program name or execution module name and a count processing is given 

5 at a program level. 

• A discrimination processing of the created program name or 
execution module name and a count processing is executed at a hardware 
level . 

(2) Portion of Xi,j (p) ^^d Yi (p) 
10 (a) Each time an event Xi, j (p) f Yi(p) appears, a flag for that is 

turned ON at the start of the processing, and the flag for that is set 

OFF at the end of the processing. 

It is assumed that at the time of execution, it is discriminated 

at predetermined time intervals whether a flag for each event is ON/OFF, 
15 and the number of times that it is discriminated that the flag is ON 

is counted to obtain the number of times of sampling. Since there is 

a case where the event Xi, j (p) and ji (p) cannot be detected by one method, 

a description and a processing in the following methods are combined 

as the need arises and a measurement is made. 
20 • A programmer detects the start and the end of a processing in 

a program, that is, a position where the flag is to be turned ON/OFF, 

and gives a description for turning the flag ON/OFF. 

• In the case where a parallel language extension or a compiler 
directive is used, a tool interprets the parallel language extension 

25 or the compiler directive, and gives a description for turning the flag 
ON/OFF. 

• In the case where a parallel language extension or a compiler 
directive is used, a compiler interprets the parallel language extension 
or the compiler directive, and gives a description for turning the flag 

30 ON/OFF. 



• The complier detects a start and an end of a processing in a 
program, that is, a position where the flag is to be turned ON/OFF, and 
gives a description for turning the flag ON/OFF. 

• An OS detects a start and an end of a processing in a program, 
5 that is, a position where the flag is to be turned ON/OFF, and gives 

a description for turning the flag ON/OFF. 

• A runtime library detects a start and an end of a processing 
in a program, that is, a position where the flag is to be turned ON/OFF, 
and gives a description for turning the flag ON/OFF. 

10 • Hardware detects a start and an end of a processing in a program, 

that is, a position where the flag is to be turned ON/OFF, and gives 

a description for turning the flag ON/OFF. 

■ A description for a processing of discriminating that the flag 

is ON and counting the number of times is given at a complier level - 
15 • A description for a processing of discriminating that the flag 

is ON and counting the number of times is given at an OS level . 

• A description for a processing of discriminating that the flag 
is ON and counting the number of times is given at a runtime library 
level - 

20 - A description for a processing of discriminating that the flag 

is ON and counting the number of times is given at a hardware level. 

• A description for a processing of discriminating that the flag 
is ON and counting the number of times is given at a tool level - 

• A description for a processing of discriminating that the flag 
25 is ON and counting the number of times is given at an application program 

level . 

" A processing of discriminating that the flag is ON and counting 
the number of times is executed at a hardware level. 

(b) Known module names are previously classified into a parallel 
30 processing part or a processing part relating to a parallel performance 
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impediment factor / the module names are discriminated at the time of 
execution, and discrimination of the respective module names are counted 
to obtain the number of times of sampling. A classifying method set forth 
below, a discrimination processing and a count processing are combined 
5 as the need arises, and a measurement is made - 

• A classification of module names is made at a compiler level. 

• A classification of module names is made at an OS level. 

• A classification of module names is made at a runtime library 

level . 

10 - • A classification of module names is made at a hardware level. 

• A classification of module names is made at a parallel language 
extension or compiler directive level. 

• A classification of module names is made at a user level. 

• A description for a discrimination processing of the module name 
count processing is given at a compiler level. 

• A description for a discrimination processing of the module name 
count processing is given at an OS level. 

• A description for a discrimination processing of the module name 
count processing is given at a runtime library level. 

• A description for a discrimination processing of the module name 
count processing is given at a hardware level. 

• A description for a discrimination processing of the module name 
count processing is given at a tool level. 

• A description for a discrimination processing of the module name 
count processing is given at a program level. 

• A description for a discrimination processing of the module name 
count processing is given at a hardware level. 

As an example, by using a program which describes a processing 
of adding respective elements of F (Imax) existing in all processors 
30 in Fortran and MPI (Message Passing Interface) of a parallel library, 
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a sampling method of (P) will be described in following table 1. For 
example, an ON flag is expressed by ^^*sampon" to give an instruction 
to a compiler, and an OFF flag is expressed by ^^*sampoff". Besides, RED 
denotes a redundancy processing, C denotes communication, and a numeral 
5 after RED or C denotes appearing order. Since a portion of a summation 
processing is. a redundancy processing in which the same calculation is 
carried out in the respective processors, '''"^sampon (RED), 2" and 
^^*sainpoff (RED) , 2" are arranged at the start point and the end point, 
and the flag is turned ON/OFF. Since calculation of a variable of nLOCAL 

10 is also a redundancy processing, ^^*sampon (RED) , 1" and ^^*sampoff (RED) , 
V are arranged at the start point and the end point, and the flag is 
turned ON/OFF. Further, MPI_ALLTOALL is a communication library, and 
here, ^^*sampon (C) , 1", ^^*sampoff (C) , 1", ^^*sampon (C) , 2'' and "*sampoff 
(C) , 2" are arranged, and the flags are turned ON/OFF. Incidentally, 

15 in the case of MPI_ALLTOALL , it is also possible for a tool, a compiler 
or an OS to discriminate an event and to set a flag. 
[Table 1] 

subroutine GSUM(Imax, F, FW, NP) 
real*8 F(Imax), FW(Imax) 

20 include 'mpif.h' 

*sampon (RED) , 1 

nLOCAL = (Imax + NP - 1)/NP 
*sampof f (RED) , 1 

25 

*sampon (c) , 1 

call MPI_ALLTOALL (F, nLOCAL, MPI_DOUBLE_PRECISION, 
& FW, nLOCAL, MPI_DOUBLE_PRECI S ION, 

& MPI_COMM_WORLD, I ERR) 

30 *sampof f (C) , 1 
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*sampon (RED) , 2 

do j = 2, NP 
k = (j - 1) *nLOCAL 
5 do i = 1, nLOCAL 

FW(i) = FW(i) + FW(i + k) 
end do 
end do 

10 do j = 2, NP 

k = (j - 1) *nLOCAL 
do i = 1, nLOCAL 

FW(i + k) = FW(i) 
end do 
15 end do 

*sampoff (RED) , 2 

*sainpon (C) , 2 

call MPI_ALLTOALL (FW, nLOCAL, MPI_DOUBLE_PRECISION, 
20 & F, nLOCAL, MPI_DOUBLE_PRECISION 

& MPI_COMM_WORLD, TERR 

*sainpoff (C) , 2 
return 
end 

25 Fig. 21 shows an example of sampling when the program of table 

1 is executed- In Fig. 21, a count is made for each of events ^MREE>) / 
1'', ^MRED) , 2", 'MC) , 1", and 'MC) , 2". However, when a parallel 
efficiency or the like is calculated, they are handled as total 
redundancy processing ^MRED) , 1 + (RED), 2", and total communication 

30 processing 'MC) , 1 + (C) , 2" - 
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Returning to the description of Fig. 19, the data acquisition unit 
10 of the parallel performance analyzer 100 acquires respective 

processing times Yi (P) ^ Xi,j (P) / (P) / which are measured as a 

processing time or a sampling number by the measurement unit 201 as 
5 described above, from the parallel computer system 200, and stores them 
in the log data storage 30 connected to the parallel performance analyzer 
100. In addition to the respective processing times, data such as 
parallel performance evaluation indexes including a calculated parallel 
efficiency are also stored in the log data storage 30. As described above, 

10 the load balance contribution ratio calculator 11 calculates the load 
balance contribution ratio Rb(p) in accordance with the expression (5) 
and stores it in the storage device. Incidentally, T(p) is calculated 
in accordance with the expression (2) . The virtual parallelization ratio 
calculator 12 calculates the virtual parallelization ratio Rp (p) in 

15 accordance with the expression (6-1) and stores it in the storage device. 
Incidentally, with respect to t(1) of the denominator of the expression 
(6-1) , there is also a case where an approximation as indicated in the 
expression (8-1) is performed- Besides, there is also a case where the 
expression (10) and expressions (11) to (15) are used. Incidentally, 

20 with respect to the second term (Xi,j(l) / j = 1) in the expression (15) , 
there is also a case where a calculation is carried out by one of the 
expressions (12-1), (12-2), (12-3) and (12-4). Besides, with respect 
to j > 1, there is also a case where a calculation is carried 

out by one of the expressions (13-3) to (13-5) . j = 1 of Xi,j (P) ^ 

25 redundancy processing. However, it may be a processing time concerning 
another parallel performance impediment factor - 

The parallel performance impediment factor contribution ratio 
calculator 13 calculates the parallel performance impediment factor 
contribution ratio Rj (p) concerning each parallel performance 

30 impediment factor in accordance with the expression (7) and stores it 



in the storage device. The parallel efficiency calculator 14 computes 
parallel efficiency Ep (p) in accordance with one of the expressions 
(8-2) / (9-1) and (9-2) in the case where the conditions of the expression 
(4-4) , (4-5) or (8-1) are satisfied/ and stores it in the storage device. 
5 In the case where the expression (9-2) is used, there is also a case 
where the first term of the numerator is calculated by one of the 
expressions (12-1), (12-2), (12-3), (12-4), (13-3), (13-4) and (13-5). 
The expressions (12-1) to (12-4) are (p) for the redundancy 

processing of j = 1 . The auxiliary index calculator 15 calculates the 

10 acceleration ratio Ap in accordance with, for example, the expression 
(6-2) and calculates Ep (p) -p from the parallel efficiency Ep (p) and the 
number p of processors, and stores it in the storage device. 

The processor number optimizer 21 executes a processing to 
indicate to an end user of the parallel computer system 200 the optimum 

15 number of processors to be applied for processing. The processor add-on 
estimation processor 22 executes a processing for indicating to an 
operational administrator of the parallel computer system 200 a 
numerical value as a guideline at the add-on of the processor. The system 
replacement data processor 23 executes a processing of indicating to 

20 the operational administrator of the parallel computer system 200 a 
numerical value as a guideline at the system replacement. The 
operational efficiency data processor 24 executes a processing for 
indicating to the administrator of the parallel computer system 200 data 
concerning system operational efficiency. The tuning processor 25 

25 executes a processing for enabling a programmer to execute effective 
tuning by suitable performance objective setting or the like to a program 
for performing a parallel processing- The algorithm selection processor 
26 executes a processing for enabling a programmer to select an algorithm 
capable of improving parallel efficiency with respect to a program for 

30 performing a parallel processing in the case where different algorithms 



exist for the same processing. The parallel performance evaluation 
processor 27 executes a processing for enabling a developer or a 
researcher of a parallel computer system to easily evaluate parallel 
performance. The detailed processing contents of these processing units 
5 will be described below. 

Next, a processing flow of the system or the like shown in Fig. 
19 will be described by using Fig. 22. At first, a pre-processing is 
executed which includes a description for direct measurement of a 
processing time, a description for turning ON/OFF a flag for counting 

10 the number of times of sampling corresponding to each processing time 
by a compiler, an OS, a tool, a programmer, a runtime library, hardware 
or the like, and/or a classification of a module name and the like for 
counting the number of times of sampling corresponding to each 
processing time by the compiler, the OS, the tool, the programmer, the 

15 runtime library, the hardware or the like (step SI) . There is a case 
where this processing is performed in the parallel computer system 200 
or is performed in another computer system- Further, there is also a 
case where a person such as a programmer performs it. Incidentally, since 
there is also a case where the step SI is not a processing executed in 

20 the parallel performance analyzer 100 and is not a processing executed 
in the parallel computer system 200, it is indicated by a block of a 
dotted line. 

Next, the measurement unit 201 of the parallel computer system 
200 executes a measurement processing to make a measurement of a 
25 processing time or a measurement processing to count the number of times 
of sampling on the basis of the pre-processing (step S3) . The respective 
processing times Yi (p) # Xi, j (P) r and Xi (p) as measurement results , or count 
values of sampling corresponding to the respective processing times are 
stored in the storage device of the parallel computer system 200, and 
30 are read out by the data acquisition unit 10 of the parallel performance 
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analyzer 100- When acquiring the respective processing times Yi (p) / 
(P) ^ ^T^d. Ti (p) or the count values of sampling corresponding to the 
respective processing times, the data acquisition unit 10 stores them 
in the log data storage 30 of the parallel performance analyzer 100. 
5 Then, the load balance contribution ratio calculator 11, the 

virtual parallelization ratio calculator 12, the parallel performance 
impediment factor contribution ratio calculator 13, the parallel 
efficiency calculator 14, and the auxiliary index calculator 15 use the 
respective processing times Yi (P) / Xio (P) / (p) stored in the log 

10 data storage 30, or the count values of sampling corresponding to the 
respective processing times to calculate the load balance contribution 
ratio Rb(p), the virtual parallelization ratio Rp (p) , the respective 
parallel performance impediment factor contribution ratios Rj (p) , the 
parallel efficiency Ep(p) , the acceleration ratio Ap and other auxiliary 

15 indexes, and stores them in the log data storage 30 (step S5) , 

As described above, in the case where the conditions of the 
expression (4-4) , (4-5) , or (8-1) is satisfied, the parallel efficiency 
Ep(p) is calculated in accordance with one of the expressions (8-2), 
(9-1) and (9-2) . Accordingly, the parallel efficiency calculator 14 

20 calculates the parallel efficiency Ep (p) by using the load balance 
contribution ratio Rb(p) calculated by the load balance contribution 
ratio calculator 11, the virtual parallelization ratio Rp(p) calculated 
by the virtual parallelization ratio calculator 12, the respective 
parallel performance impediment factor contribution ratios Rj (p) 

25 calculated by the parallel performance impediment factor contribution 
ratio calculator 13, and the acceleration ratio Ap (p) calculated by the 
auxiliary index calculator 15 and by using the processing times and the 
like stored in the log data storage 30 as to other necessary data. 

For example, a calculation example in the case where measurement 

30 results of processing times as shown in Fig. 23 are stored in the log 
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data storage 30, will be described below- More specifically, it is 
assumed that measurement results of Xi (p) = 34, T2 (p) = 35, XaCp) = 33, 
T4(P) = 37, Yi(P) = 15r Y2(P) = 14, Y3(P) = 13, Y4 (p) = 16, Xi,red(P) = 8, 
X2,red(p) = 9, X3,red(P) = 7, X4,red(P) = 7, Xi,c(p) = 10, X2,c(p) = 11/ X3,c(p) 
5 = 12, X4,c(p) = 13 are obtained. Accordingly, Xi, others (p) =1 (=34-15 

- 8 - 10), X2, others (P) =1 (=35-14-9-11), X3, others (P) = 1 (= 33 " 

13 - 7 - 12) , and X4, others (p) =1 (= 37 - 16 - 7 - 13) . Although necessary 
for the following calculation, it is assumed that both Xi/c(l) and 
Xi, others ( 1) / which have not been measured, are 0. 
10 (1) Load balance contribution ratio (expression (5)) 

MP)-- 



15 



r(p)p 
34 + 35 + 33 + 37 139 



= 0.9392 



37-4 148 

(2) Virtual parallelization ratio (expressions (6-1) , (12-1) , (15) ) 
Xl,REDa)^- jiXi,RED(P) = ~(^ + 9 + 7 + l) = 7.75 

p ti 4 

P Jothers 

= Sr.Cp) + 2^Xij(X) = (15 + 14 + 13 + 16) + 7.75 = 65.75 
" T(l) 65.75 65.75 

(3) Acceleration ratio (expression (6-2)) 

(P) = = = 8.482 

1-Rpip) 1-0.8821 

(4) Parallel performance impediment factor contribution ratio 
(expression (7)) 

(4-1) Parallel performance impediment factor contribution ratio of 
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redundancy processing 



4 

ZuZi,REDiP) ^ ^ 

^^.(P) = = ^^^1^ = 0.2230 

(4-2) Parallel performance impediment factor contribution ratio of 
communication processing 
P 

;^(^) = if:f^!! = 10±il±ll±13 = o.3309 

i=l 

(4-3) Other parallel performance impediment factor contribution ratio 
P 

^others 

t^^iip) 

1=1 

(5-1) Parallel efficiency (expression (4-4) ) 

Epip) = Ri,(ip)- • (1 - RrED (P) - Rc(P} - RothersiP)) 

= 0.9392 (1 - 0.2230 - 0.3309 - 0,0288)= 0.4443 

0.8821 ^ ^ 

(5-2) Parallel efficiency (expression (9-1)) 

1 ^^'^'^ 1 58 

(4) = Ri,(4) ^ = 0.9392 — = 0.4443 

' RJA) P 0.8821 139 

(5-3) Parallel efficiency (expression (9-2)) 



7.75 + 58 
37-4 



Ep{p) ^ 



1=1 



= 0.4443 



t:{p)P 



The above results are collected as shown in Fig. 13. Incidentally, 



Ep(p) 'P as an auxiliary index is also calculated. It is understood from 
Ep(4) -p = 1-777 that the processing is performed in the four-processor 
5 parallel v/ith the performance of 1.777 processors. The parallel 
efficiency is lowered to about 94% (=Rb(4) = 0.9392) by the load balance . 
The influences of the parallel performance impediment factors are about 
22% (= Rred(4) = 0.2230) through the redundancy processing, about 33% 
(= Rc(4) = 0.3309) through the communication/ and about 3% (= Rothers(4) 

10 = 0.0288) through the others. Accordingly, the parallel efficiency is 
lowered by about 55% mainly through the communication and the redundancy 
processing execution. Besides, in Fig. 13, from Rp(4) = 0 .8821, it can 
be estimated that the parallel maximum performance at the time when 
processors are infinitely applied is 8.482 (= Ap(4) = 1/ (1 - 0.8821) 

15 times as higher than that of one processor. Accordingly, it is understood 
that this processing should be performed by eight processors or less. 
If this processing is performed under the condition of Ep (x) > 0.8, then 
Ep (x) -p = 0 - 4443 X 4 = 1.777 = Ep{x) - x = 0.8*x, and x = 2.22. Accordingly, 
it is expected that p > 2.22 = 2 is the optimum number of processors 

20 to the given condition - 



count numbers by sampling as shown in Fig. 24 are stored in the log data 
storage 30 by the data acquisition unit 10, will be described below. 
More specifically, it is assumed that measurement results of Xi(p) = 3488, 



25 T2(p) = 3561, T3{p) = 3372, T4(p) = 3756, yi (p) = 1521, Yz (p) - 1411, 73 (p) 
= 1322, Y4 (p) = 1601, Xi,red(p) = 823, X2,red(P) = 945, X3,red(P) = 711, X4,red(p) 
= 703, Xi,c(p) = 1056, X2,c(p) = 1111. X3,c(p) = 1230, X4,c(p) = 1341 are 



Besides, a calculation example in the case where for example. 
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obtained. Accordingly, Xi, others (p) = 88 (= 3488 - 1521 - 823 - 1056), 
X2, others (P) = 94 (= 3561 - 1411 - 945 - 1111), Xa, others (p) = 109 (= 3372 
- 1322 - 711 - 1230), X4, others (P) = 111 (= 3756 - 1601 - 703 - 1341). It 
is assumed that both Xi,c(l) and Xi, others ( 1 ) / which is needed in the 
following calculation but have not been measured, are 0. 
(1) Load balance contribution ratio Rb(p) (expression (5)) 

_ 1=1 



T(4) • 4 



(1521+ 823 + 1056 + 88) + (141 1 + 945 + 1 11 1 + 94) + (1322 + 711 + 12304-1Q9) + (1601+ 703 + 1341 + 1 11) 

(1601 + 703 + 1341+111). 4 

3488+3561 + 3372 + 3756 14177 ^^.^^ 

= = = 0.9436 

3756-4 15024 

(2) Virtual parallelization ratio (expression (6-1) , (12-1) , (15) ) 

Xx^RED J^XuRED ip) =^\' (823 + 945 + 71 1 + 703) = 795.5 

P 

.(^) = iL:_= 1521^1411-H322-.1601 ^ 5855 ^^^^^^ 
P T(l) (1521+1411+1322+160l) + 795.5 5855+795.5 

10 (3) Acceleration ratio (expression (6-2)) 

^v(P) = = = 8.361 

^^^^ l-Rp(p) 1-0.8804 

(4) Parallel performance impediment factor contribution ratio 
(expression (7)) 

(4-1) Parallel performance impediment factor contribution ratio of 
15 redundancy processing 
4 

^^(4) = Sjf!^^ = (823.945.711.703) ^ ^ 
REDK } 4 14177 
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(4-2) Parallel performance impediment factor contribution ratio of 
communication processing 
P 

^^^^ P 14177 

(4-3) Other parallel performance impediment factor contribution ratio 

5 '=1 

(5-1) Parallelization ratio (expression (4-4) ) 

Ep{p) = Rbip) ■ • (1 - RuEDiP) - RciP) - Rothers(P)) 

RpiP) 

= 0.9436 ^ (1-0.2244 - 0.3342- 0.0284)= 0.4426 

0.8804 ^ ' 

(5-2) Parallel efficiency (expression (9-1) ) 

1 ^^^^^^ 1 5855 
E„ (p) = Rbip) ^ = 0.9436 = 0.4426 

^ Zj'^iiP) 

i=l 

10 (5-3) Parallel efficiency (expression (9-2)) 

Xl,RED(^) + triip) 7o,5 + 5«s5 

E(p) ^ L=i = ^^^-^-"^^^^ = 0.4427 

^ <P)P 3756-4 

Similarly to the case of the measurement of the processing times, 
the above results are collected as shown in Fig. 13. 

Returning to the description of Fig. 22, a set of measurement 
15 results of processing times or the like and parallel performance 
evaluation indexes and auxiliary indexes as shown in Fig. 13 are stored 
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in the log data storage 30 for each processing. Then, the parallel 
performance analyzer 100 outputs a processing result as shown in Fig. 
13 to the output device 110 such as a display device or a printer 
according to a request of a user or automatically (step SI) . 
5 The user himself /herself may perform, based on only the data as 

shown in Fig. 13, analysis concerning the parallel performance or the 
like, estimation of the optimum number of processors, estimation of an 
effect in the case where processor add-on or system replacement is 
performed, tuning of a program etc., selection of an algorithm or the 

10 like. However, various consulting support processings as described 
below are executed in accordance with the instruction of the user by 
the processor number optimizer 21, the processor add-on estimation 
processor 22, the system replacement data processor 23, the operational 
efficiency data processor 24, the tuning processor 25, the algorithm 

15 selection processor 26, and the parallel performance evaluation 
processor 27 (step S9) . By this, more specific data can be obtained. 
A. Processor number optimization processing 

A processing by the processor number optimizer 21 will be 
described by using Figs- 25 and 26. The processor number optimizer 21 

20 receives a setting input of a value of a target parallel efficiency (Ep) t 
by the user (step Sll) . Then, a calculation of the optimum number of 
processors is carried out in accordance with a following expression, 
and it is stored in the storage device (step S13) . 

(P)OPT = Ep (P) / (Ep) T-P 

25 Then, the calculated optimum number of processors is outputted 

to the output device 110 (step S15) . By this, the user can make the number 
of processors used at next execution of a processing belonging to the 
same processing group necessary minimum. For example, although 
described above, when the calculation result as shown in Fig. 13 is 

30 obtained, and when the target parallel efficiency (Ep)T =0.8, the number 
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of processors becomes p = 2.22. Accordingly, the optimum number of 
processors becomes 2. 

Besides, in the case where processings of the same processing 
group are continuously executed, it becomes possible to execute the 
5 processing more effectively while the optimum number of processors is 
adjusted. That is, a processing as shown in Fig- 26 is executed. First, 
the number p of processors is tentatively set (step S21) . This 
tentatively set number p of processors is used for the first processing 
of the same processing group. Besides, a setting of a target parallel 

10 efficiency is accepted from the user (step S23) . Then, in accordance 
with the setting of the number p of processors, a parallel processing 
is executed by the parallel computer system 200, processing times and 
the like are measured by the measurement unit 201, and the measurement 
result is stored in the storage device (step S25) . The data acquisition 

15 unit 10 stores the data such as the processing times measured by the 
measurement unit 201 into the log data storage 30. Then, the parallel 
performance evaluation indexes and the like including the parallel 
efficiency are calculated by the parallel efficiency calculator 14 and 
the like and are stored in the log data storage 30 (step S27) . 

20 Then, the processor number optimizer 21 calculates the optimum 

number (p)opt of processors and stores it in the storage device (step 
S29) . This calculated optimum number (p)opt of processors is substituted 
for p as the number of processors used for a next processing of the same 
processing group (step S31) - Then, it is judged whether or not all 

25 processings of the same processing group are executed (step S33) . In 
case all processings are not executed, a next processing in the same 
processing group is selected (step S35) , the procedure is returned to 
step S25, and the parallel processing is executed with the number of 
processors set at the step S31. 

30 By executing such a processing, the optimum number of processors 

59 



for the previous processing belonging to the same processing group can 
be set as the number of processors for a next processing, and therefore, 
it becomes possible to perform the processing of the processing group 
more effectively . 
5 B. Processor add-on estimation processing 

The processor add-on estimation processor 22 executes a 
processing for giving the acceleration ratio Asystem at the time of system 
expansion as a quantitative guideline for the processor add-on of the 
parallel computer system 200. Fig. 27 shows a processing flow. First, 

10 the processor add-on estimation processor 22 accepts a setting input 
of data of an increase of a working time at the time of the system 
expansion and data of its predicted parallel efficiency (step S41) . Then, 
the acceleration ratio Agystem at the time of the system expansion is 
calculated in accordance with the expression (18) , and is stored in the 

15 storage device (step S43) . Incidentally, data such as the working time 
of each processor under use at present are calculated by using the past 
processing log data stored in the log data storage 30. Then, the 
calculated acceleration ratio Asystem at the time of the system expansion 
is outputted to the output device 110 such as the display device (step 

20 S45) . 

By the set increase of the working time and the acceleration ratio 
Agystem at the time of the system expansion to the predicted parallel 
efficiency, it becomes possible to judge how long a time for execution 
of a meaningful processing is increased- 
25 C. System replacement data processing 

A processing for presenting a quantitative guideline for 
determining performance of a new parallel computer system is executed 
at the replacement of the parallel computer system. Fig. 28 shows a 
processing flow for that. The system replacement data processor 23 
30 receives a setting input of a target parallel efficiency (Ep)T and a 
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repeat count icmax (step S51) . Besides, as performance of a new parallel 
computer system, a setting input of performance magnification A relative 
to the present parallel computer system is accepted (step S53) . With 
respect to the performance magnification/ a setting input of a 
5 magnification Acpu of CPU performance, a magnification Ac of 
communication performance, a magnification Ax/o of I/O performance, and 
the like is accepted- A quantitative guideline is obtained by the 
magnification values. In the replacement of most computer systems, since 
the performance improvement of the system is designed by the improvement 

10 of CPU performance, for example, Acpu is first set, and Ep is calculated 
on the assumption that the other performance magnifications are ^^1". 
Then, values of Ac, Ai/o/ and the like are obtained by repeated 
calculations so as to achieve or approach (Ep)T to get a guideline for 
performance determination of a new parallel computer system. 

15 More specifically, the system replacement data processor 23 

carries out a calculation to shorten the respective processing times 
and the like stored in the log data storage 30 in accordance with the 
respective set performance magnifications (step S55) - For example, in 
the case where the CPU performance is set to be five times as higher 

20 (Acpu = 5) , a calculation to reduce the processing time Yi (P) ^^^^ the like 
of the parallel processing to 1/5 is carried out- Then, on the basis 
of the respective processing times shortened in accordance with the 
respective set performance magnifications, estimated values (for 
example, an estimated value of the parallel efficiency {Ep)e) of the 

25 parallel performance evaluation indexes including the parallel 
efficiency are calculated and stored in the storage device (step S57) - 
The system replacement data processor 23 judges whether or not 
the estimated value (Ep)E of the parallel efficiency is coincident with 
the target parallel efficiency (Ep)T (step S59) . Complete coincidence 

30 is not necessarily required, and it is judged whether or not the 



estimated value (Ep)E falls within a predetermined range of the target 
parallel efficiency (Ep)T- In a case where it is judged that the estimated 
value (Ep)E of the parallel efficiency is almost coincident with the 
target parallel efficiency {Ep)T/ a message indicating the achievement 
5 of the target parallel efficiency and the estimated values of the 
respective parallel performance evaluation indexes calculated at the 
step S57 are outputted to the output device 10 such as the display device 
(step S61) . On the other hand, in a case where it can not be said that 
the estimated value (Ep)E of the parallel efficiency is almost coincident 

10 with the target parallel efficiency (Ep)Tf it is judged whether the 
counter ic becomes the repeat count icmax or more (step S63) . In case the 
counter ic becomes the repeat count icmax or more, a message indicating 
that the target parallel efficiency was not capable of being achieved, 
and the estimated values of the parallel performance evaluation indexes 

15 calculated at the step S57 are outputted to the output device 110 such 
as the display device (step S65) . 

On the other hand, in the case where ic is less than the repeat 
count icmaxf ^ change of the performance magnification, such as the 
magnification of CPU performance, the magnification of communication 

20 performance, and the magnification of I/O performance, is executed (step 
S67) . This step may be automatically carried out, or may accept a setting 
by the user. Then, the counter ic is incremented by one (step S69) , and 
the procedure returns to the step S55 . 

In the above processing, in order to achieve the target parallel 

25 efficiency (Ep)Tf the performance magnification is changed up to the 
maximum repeat count icmaxf and an estimation of the parallel efficiency 
is made. Incidentally, it is possible to select a new parallel computer 
system having performance satisfying (Ep)T for a specific processing the 
processing times of which are stored in the log data storage 30, or to 

30 select a new parallel computer system having performance satisfying 
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(Ep)T for some kinds of processings. 

An application example of the processing flow of Fig. 28 will be 
described using the case where the processing times as shown in Fig. 
23 are specifically measured- Here, the target parallel efficiency is 
set to (Ep)T =0.6, and when it is supposed that a new system having CPU 
performance five times as higher, that is, Acpu = 5, is introduced, Yi (p) 
and Xi,RED(p) are reduced to 1/Acpu- It is assumed that Xi, others (p) having 
an unclear property is also reduced to 1/Acpu- On the other hand, Xi,c(p) 
depends on network performance. Here, first, on condition of Ac = oo, 
a possibility of realization will be considered. Incidentally, it is 
assumed that both of Xi,c (1) and Xi, others (1) / which hav© not been measured, 
are 0 . 

A following calculation is carried out from the expression 
(12-1) . 

XuRED^) ^ - • f,XijiEDip) = i (8 + 9 + 7 + 7) = 7.75 

[ (Ep) E in the case of Acpu = 5 and Ac = Qo] 

i,Mp) i{{7i(p)+Xi,m}iP)+Xi,oiken>(p)y^cpu ■^XifiiP)l^c) 
hiP)=^. rr 

^ (15 + 8 + l)/5 + 10/°o + (14+9 + l)/5 + ll/oo + (13 + 7 + l)/5 + 12/oo + (16 + 7 + l)/5 + 13/c.o 

((16+7+l)/5+13/<x'>4 

_ (24+24+ 21 + 24)/5 ^ 18.6 
24/5-4 19.2 

= 0.9688 
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^Yiip) Lri(pVA:pu LUp) 

RviP)=— ^ =-;; — 

^ t(1) p p 

Z,YiiP)'^PU +Xl,REDi^VA:PU J,ri(P) + Xl,REDi^) 

J=l 1=1 
a5 + 14+13+16)/5 _ 58 ^^^^^^l 



(15+14+13+16)/5 + 7.75/5 (58 + 7.75) 

4 4 
Y^XiMDiP) Y.Xi,RED(P)f ^PU 
RREDiP) = ^, > 



4 

Y/iiP) E hiiP) + XiMDiP) + Xi,olhersiP))l ^PU + Xi,c{PV A:] 
1=1 i=l 

(8 + 9 + 7 + 7)/5 



18.6 



= 0.3333 



4 4 

AjAi, others ip) Lx Lothers 

^others (P) = ~ ^ ~ 



4 

1=1 i=l 



t,MP) j,[(ri(P)+.Xi,RED(P) + Xi,olheM)f^CPU +Xi,c(j>)/^] 



= 2111111)^=0.0430 
18.6 

ip) = Rb ip) ■ • (1 - ^RED iP)-RciP)- Rothers iP)) 

Rpip) 

= 0.9688 ^ (l - 0.3333 - 0 - 0.0430) = 0.6850 

0.8821 

The above calculation results are collected in Fig. 29. When the 

magnification is Ac = <x>, the term of Xi,c(p) becomes 0, and Ep = 0.6850, 
and therefore, it is larger than the target value 0.6. Accordingly, it 
is understood that there is a possibility that the target parallel 
efficiency can be achieved in accordance with the improvement of 
performance of Ac. Then, at step S67, while Ac is changed, a calculation 
of the parallel efficiency is repeatedly carried out (step S57) to find 
out Ac which produces Ep(p) ~ 0.6. Although an intermediate calculation 
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is omitted, a calculation result in the case of Ep(p) ~ 0.6 is shown at 
the second line of Fig. 29- In this case, Ac = 19.2. From this, in the 
case where it is desired to achieve the CPU performance of Acpu = 5 and 
(Ep)T =0.6, there is obtained a guideline indicating that it is proper 
5 to find out a parallel computer system having performance of Ac = 19-2 
or more and to make a replacement with such a system. 

Incidentally, in the case where Ac = 19.2 is too high and such 
a system does not exist at present, a reduction of another parallel 
performance impediment factor is considered. According to the 

10 estimation result of the second line of Fig. 29, it is understood that 
the redundancy processing Rred (4) = 0.2953 should be improved. In order 
to reduce the redundancy processing, tuning of a program becomes 
necessary. The tuned program is executed, the processing of Fig. 28 is 
again executed, and Ac has only to be estimated again - 

15 Besides, by predicting the parallel performance at the time of 

system replacement as shown in Fig. 29, it is also possible to estimate 
the system operational efficiency Egystem of a newly introduced parallel 
computer system. For example, in the case where processings 1 to 4 the 
estimation of which is shown in Fig. 18 are performed by a new system 

20 of Acpu = 5 and Ac = 19.2 in which the target parallel efficiency (Ep)T 
= 0.6 is cleared in a certain processing, it is understood that the 
efficiency becomes Esystem = 0.653 4. By comparing this predicted Egystem 
with Esystem obtained from the past processing log, it becomes possible 
to quantitatively indicate the improvement of the working rate due to 

25 the replacement of the system with the more well grounded data. 
D. System operational efficiency improvement processing 

On the basis of the index of the system operational efficiency 
Egystem expressed by the expression (17) , the operational efficiency of 
a system is estimated. For the improvement of the index, there is given 

30 a specific guideline concerning the improvement of the operational 



efficiency, for example, a guideline is given to indicate that parallel 
efficiency of which processing must be improved to what extent . 
Specifically, a processing of Fig . 30 is executed - 

The operational efficiency data processor 24 accepts a setting 
5 input of a repeat count icmax and a target value (Esystem)T of a system 
operational efficiency by an operational administrator (step S71) . Then, 
data, such as processing times and parallel efficiency, stored in the 
log data storage 30 are read out, and the system operational efficiency 
Esystem is Calculated in accordance with the expression (17) and is stored 

10 in the storage device (step S73) . Incidentally, in the case where a 
calculation of the parallel performance evaluation indexes including 
the parallel efficiency has not been carried out, at this stage, the 
parallel performance evaluation indexes including the parallel 
efficiency are calculated by the load balance contribution ratio 

15 calculator 11, the virtual parallelization ratio calculator 12, the 
parallel performance impediment factor contribution ratio calculator 
13, the parallel efficiency calculator 14 and the like. Then, it is 
judged whether the system operational efficiency Esystem calculated at 
step S73 exceeds the target value (Esystem)? of the system operational 

20 efficiency (step 375). In case it is judged that Esystem > (Esystem)! is 
established, a message indicating the achievement of the object and the 
system operational efficiency Esystem computed at the step S73 are 
outputted to the output device 110 such as the display device (step S77) . 

On the other hand, in the case of Esystem ^ (Egystem)?/ it is judged whether 
25 the counter value ic is the repeat count icmax or more (step S79) . If the 
counter value ic is the repeat count icmax or more, in order to inform 
that the system operational efficiency improvement processing does not 
well function, a message indicating that the object is not achieved and 
the system operational efficiency Esystem calculated at the adjacent step 
30 S73 are outputted to the output device 110 such as the display device 
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(step S81) . 

On the other hand, if the counter value ic is less than the repeat 
count icmaxf the operational efficiency data processor 24 recommends an 
end user to execute an improvement process for end users, a system 
5 administrator to execute an improvement process for system 
administrators, a programmer to execute an improvement process for 
programmers, and a parallel computer system developer or researcher to 
execute an improvement process for parallel computer system developers 
or researchers, and the end user or the like executes the possible system 

10 operational efficiency improvement process (step S83) . Incidentally, 
examples of the processes to be executed include optimization of the 
number of processors, add-on of a processor, system replacement, tuning 
of a program and the like. After the execution of the system operational 
efficiency improvement process, the parallel processing is again 

15 executed by the parallel computer system 200, and at the same time, a 
measurement processing of processing times and the like by the 
measurement unit 201 is executed (step S85) . Then, the counter value 
ic is incremented by one (step S87) , and the procedure returns to the 
step S73- Incidentally, since there is also a case where the step S83 

20 is a processing performed by the end user, it is indicated by a block 
of a dotted line, and since the step S85 is not a processing of the 
parallel performance analyzer 100, it is indicated by a block of an 
alternate long and short dash line. 

By executing such a processing, it becomes possible to improve 

25 the system operational efficiency in which consideration is given to 
the parallel efficiency which has not been considered in the 
conventional working rate NWRsystem/ that is, consideration is given to 
an effective processing time - 
E- Tuning processing 

30 Conventionally, with respect to the performance improvement 



operation by tuning of a parallel application program, since an 
achievement object is unclear, the estimation of its working time has 
not been easy. There is also a case where a target processing time cannot 
be attained by tuning, and there are many cases where the tuning 
5 operation is continued endlessly and a lot of working time is spent. 
Then, a processing as shown in Fig. 31 is executed. 

First, the tuning processor 25 accepts a setting input of a target 
processing time (x)t, a repeat count icmax and a limit parallel efficiency 
(Ep)max by a programmer (step S91) - Next, by using data of the parallel 

10 efficiency and the processing time stored in the log data storage 30 
(for example, the parallel efficiency and the processing time included 
in the processing log of a program to be tuned) , the target parallel 
efficiency (Ep)T corresponding to the target processing time (t) t is 
calculated and is stored in the storage device (step S93) . The target 

15 parallel efficiency (Ep)T is calculated by a following expression. This 
expression expresses linear extrapolation. 

(Ep)T = max(Ti) x Ep(p)/(T)T 

Then, it is judged whether the target parallel efficiency (Ep) t 
is not higher than the limit parallel efficiency {Ep)niax (step S95) . When 

20 the target processing time (x) t is set without any limitation, an 
unrealizable target parallel efficiency (Ep)T is capable of being set. 
Then, it is judged at this step whether the target processing time (t) t 
is suitable. In case the target parallel efficiency (Ep)T exceeds the 
limit parallel efficiency (Ep)max/ since it becomes necessary to set the 

25 target processing time (x) t or the limit parallel efficiency (Ep)max again, 
the procedure returns to the step S91. 

On the other hand, in the case where the target parallel 
efficiency (Ep)T is not higher than the limit parallel efficiency (Ep)inax/ 
it is judged whether the processing time T(p) measured this time is not 

30 higher than the target processing time (t)t (step 597). Incidentally, 



the first processing of the step S97 is always judged to be No. In case 
the processing time measured this time is not higher than the target 
processing time (t)t/ a message indicating that the object is achieved, 
and data such as the achieved parallel efficiency and the processing 
5 time x(p) are outputted to the output device 110 such as the display 
device (step S99) - On the other hand, in the case where the processing 
time measured this time exceeds the target processing time (t)t, it is 
judged whether the counter value ic is not less than the repeat count 
icmax (step SlOl) - In case the counter value ic becomes the repeat count 
10 icmax or more, a message indicating that the object can not be achieved, 
and data, such as an achieved parallel efficiency and a processing time 
T(p) , are outputted to the output device 110 such as the display device 
(step S103) . 

In case the counter value ic is less than the repeat count icmaxf 

15 the counter value ic is incremented by one (step S105) . Then, tuning is 
performed concerning the parallel performance impediment factors such 
the redundancy processing, load balance, communication processing, or 
I/O (step S107) . There is a case where the program is not rewritten, 
but tuned by using a tool, a compiler, a runtime library or the like. 

20 Since a programmer may perform this process, it is indicated here by 
a block of a dotted line - After the tuning, the program is again processed 
in parallel in the parallel computer system 200, and at the same time, 
a measurement processing of the processing times and the like is executed 
by the measurement unit 201, and they are stored in the storage device 

25 (step S109) - Since the step S109 is also not a processing of the parallel 
performance analyzer 100, it is expressed by a block of an alternate 
long and short dash line. Thereafter, the data acquisition unit 10 
acquires the data of the processing time and the like from the parallel 
computer system 200, and stores it in the log data storage 30. Then, 

30 the parallel performance evaluation indexes including the parallel 



efficiency are computed by the load balance contribution ratio 
calculator 11, the virtual parallelization ratio calculator 12, the 
parallel performance impediment factor contribution ratio calculator 
13, the parallel efficiency calculator 14 and the like, and are stored 
5 in the log data storage 30 (step Sill) . Then, the procedure returns to 
the step S97. 

As stated above, since the tuning operation is executed to achieve 
the target processing time {t)t by a predetermined number of times of 
tuning, it becomes possible also for the programmer to carry out an 

10 effective work. 

For example, on the basis of the processing times as shown in Fig. 
23, a specific example will be described. At this time, since T(p) = 
37 and Ep(4) = 0 .4443, if (Ep)max = 0-6 and (x) t = 28, then (Ep)T = 0.5871. 
Accordingly, the procedure proceeds from the step S95 to the step S97 . 

15 Since this is the first processing, the procedure proceeds from the step 
S97 to step S107 through steps SlOl and S105- As the first tuning, it 
is assumed that the communication time Xc is reduced to 1/2. By using 
the result, the parallel performance evaluation indexes including the 
parallel efficiency are calculated at step Sill. Then, the result as 

20 shown in Fig. 32 is obtained- Incidentally, Fig. 32 shows a comparison 
including the indication of a processing time max(Xi) . 

A calculation method in the case where the communication time Xc 
is reduced to 1/2 as the tuning is as follows- Incidentally, processing 

times are assumed to be Xi,c(4) = 10/2 = 5, %2,c(4) = 11/2 = 5.5, X3,c(4) 
25 = 12/2 = 6, and X4,c(4) = 13/2 = 6.5. Besides, from the equation (12-1) , 
a calculation is carried out as follows - 

Xi,red(^) ^ - ■ f,Xi,REDiP) = i ' (8 + 9 + 7 + 7) = 7.75 

(1) Load balance contribution ratio Rb(p) (expression (5)) 
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J,^iiP) J,{riiP)+Xi,REDiP) + Xi,ciP) + Xi,others(P)) 



* r(p)p max(r,(/?))/7 

^ (15+8+1) + 5 + (14 + 9+1) + 5.5+(13 + 7 + 1) + 6 + (16 + 7h-1) + 6.5 

(16+7+13/2+l)-4 
_ 29 + 29.5 + 27 + 30.5 _116 
30.5-4 ~122 

= 0.9508 

(2) Virtual parallelization ratio (expression (6-1) ) 

P 

yy.(n) 

. (^) = £L1_ = _J!l±J±ti3±16)_ ^ ^ 0 1 
^^^^ T(l) (15 + 14 + 13 + 16) + 7.75 (58 + 7.75) 

(3) Parallel performance impediment factor contribution ratio 
5 (expression (7) ) 

4 

Ri^d(p) = = ^ ^ = 0-2672 

1=1 

p 

(^) = M = ^^^-^^^-^^-^ = 0. 1983 

. P 116 

t^^p) 

1=1 
P 

^^Xi^othersiP) 1111 
^Hers(p) = ^ = Ue'' 

1=1 

(4-1) Parallel efficiency (expression (4-4) ) 
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Ep(P) = Rb(P) ■ ■ (1 - RREDiP) - RciP) - RothersiP)) 

RpiP) 

^ 0.9508 ^- (1 - 0.2672-0.1983- 0.0345)= 0.5389 

0.8821 ^ 

(4-2) Parallel efficiency (expression (9-2) ) 

1 ^'''^''^ 1 58 

(4) = Ru (4) ' ^ = 0.9508 — = 0.5389 

^ ^ RJA) P 0.8821 116 

since the processing time max (ti) (= T(p)) obtained by the first 
5 tuning is 30.5 and the target processing time (t) t cannot be achieved, 
it is necessary to again perform some tuning - 

Conventionally, evaluation of the parallel performance has been 
performed on the basis of a processing time comparison, such as a time 
change by a change of the number of processors, a processing time 

10 comparison with another system, or a comparison between the numbers of 
operations performed in a time- This requires two or more time 
measurements and causes to increase a program development time. Besides, 
in the relative parallel performance evaluation by this comparison, in 
the case where processing data is changed, it becomes necessary to again 

15 measure a comparison reference. As stated above, it takes a time to 
perform the parallel performance evaluation, and consequently, there 
occurs a case where an application program, which exhibits parallel 
performance only under a certain condition, is developed- By executing 
the aforementioned processing, the parallel performance evaluation 

20 using the parallel efficiency can be made by one measurement, and it 
becomes possible to greatly shorten a performance evaluation time in 
the development time of a parallel application program. As a result, 
development of a parallel application program in which consideration 
is sufficiently given to the parallel performance can be practically 
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executed . 

Besides, conventionally, in the performance improvement 
operation by tuning of an application program, since an achievement 
object is unclear, it is not easy to make a working time estimation. 
5 Besides, it is unclear when an operation should be ended, and 
consequently, there also occurs a case where it takes a lot of working 
time. Further, there is also a case where not the tuning of an application 
program, but redevelopment thereof is required. By executing the 
aforementioned processing, an object of the parallel efficiency 

10 improvement by tuning of an application program is definitely determined, 
and an estimation of a working time can also be made through the repeat 
count of tuning and the like. 

Further, conventionally, tuning of an application program is 
performed in such a form that a procedure (part of the application 

15 program) having a long processing time in the application program is 
found out by time measurement or the like, a parallel performance 
impediment factor as a problem is found out in the procedure by 
comparison between processing times, and the processing time is 
decreased. By the processing as described above, a performance 

20 evaluation of the load balance is first enabled for such tuning of the 
part of the application program. 
F. Algorithm selection processing 

Conventionally, although a processing time is used for a 
performance comparison between programs in which an algorithm used for 

25 a part of a parallel application program is changed, it has been 
impossible to judge whether the cause of a processing time decrease is 
due to an effect of a parallel processing or an effect caused by a 
difference in the function (for example, decrease of the number of 
operations) . As a result, such a waste of resources that many processors 

30 are applied in an algorithm having a short processing time and a poor 
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scalability is overlooked. In this embodiment^ an algorithm having a 
superior parallel efficiency is selected, and the operational 
efficiency of the whole system is improved. Here, first, a description 
will be given of an example of a comparison between an algorithm unsuited 
for parallel processing and an algorithm suited for parallel processing. 
[Algorithm unsuited for parallel processing] 



For example, a description will be given of an example of a case 



where a measurement of processing times as shown in Fig. 33 is carried 
out. Incidentally, Xi/c(l) = 0- Besides, from the expression (12-1), a 
calculation is carried out as follows: 



Xi,red(}) = - • l,Xi,REDiP) = i • (50 + 50 + 50 + 50) = 50 



(1) Load balance contribution ratio (expression (5) ) 



P 




<P)'P 

^ 4 (50 + 50+10) 
"(50+50+10)4 
= 1 



(2) Virtual parallelization ratio (expression (6-1)) 



P 



Rpip) 




(50 + 50 + 50 + 50) _ 200 



= 0.8000 



T(l) 



(50 +50 + 50 + 50) + 50 (200 + 50) 



(3) Acceleration ratio (expression (6-2) ) 



1 1 



= 5.000 



^-J^p(p) 1-0.8000 



(4) Parallel performance impediment factor contribution ratio 
(expression (7) ) 



4 

1=1 

P 

. =10±iO±iO±iO ^ 0.09091 

c^^^ ^ 440 

(4-1) Parallel efficiency (expression (4-4) ) 

^ RpCp) 

= 1.000 (1 - 0.4545 - 0.09091)= 0.5682 

0.8000 

(4-2) Parallel efficiency (expression (9-1)) 

E„ (4) = i?. (4) ^ = 1.000 — = 0.5682 

^ RA4) 4. 0.8000 440 

5 ^=1 

[Algorithm suited for parallel processing] 

A description will be given of an example of a case where 
processing times as shown in Fig- 34 are measured. Incidentally, %i,c(l) 
= 0- Besides, from the expression (12-1), a calculation is carried out 
10 as follows: 



1 ^ 

p fil 



(1) Load balance contribution ratio (expression (5)) 
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X{p)'p 

4(110+10) 
~(110+10)-4 
= 1 

(2) Virtual parallelization ratio (expression (6-1)) 

. (p) = = li«±ilo±iio±i20 = 1 .000 

P T(l) 110 + 110 + 110 + 110 

(3) Acceleration ratio (expression (6-2)) 
AJp) = ^ = — ^ = oo 

^ \-Rp{p) 1-1 

(4) Parallel performance impediment factor contribution ratio 
(expression (7) ) 

4 

D / N fti 0 + 0 + 0+0 

i=l 

P 

^(p) = = IQ^^Q^^Q^^Q = 0.08333 

^^^^ P 480 

1=1 



(4-1) Parallel efficiency (expression (4-4)) 
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Ep{p) = R,{p) ■ jj^ • (1 - Rred(P) - RciP)) 

= 1 .000 — ^- — (1 - 0.0000 - 0.08333)= 0.91 67 
1.000 

(4-2) parallel efficiency (expression (9-1) ) 

, '^7'^'^ 1 440 

EU) = Ru{A) ' ^ = 1.000 — - — — = 0.9167 

^ ^ RJA) X 1.000 480 

The above processing results are collected as shown in Fig. 35. 
5 When the number of the algorithm unsuited for the parallel processing 
is j = 1, and the number of the algorithm suited for the parallel 
processing is j = 2, the acceleration ratio is Ap = 5.000 and is finite 
at j = 1, and it is understood that even if processors are increased, 
five is a limitation in efficiency. On the other hand, at j = 2, the 

10 acceleration ratio is Ap = oo, and there is a possibility that the larger 
the number of added processors is, the shorter the processing time is - 
Incidentally, the processing time x is 110 at j = 1, which is shorter 
than 120 at j = 2, and accordingly, hitherto, there is a case where an 
algorithm of j = 1 which is not suited for parallel processing is 

15 selected. 

Then, in this embodiment, a processing shown in Fig. 36 is 
executed in the algorithm selection processor 26. First, a setting input 
of a target processing time (x) t by a programmer is accepted (step S121) . 
Then, as an initial setting, an algorithm number is set to 1, and an 
20 optimum algorithm number jx is set to 1 (step S123) . Besides, in the case 
of j = 1, the number (p)i of processors necessary for achieving the target 
processing time (x)t is calculated by linear extrapolation (step S125) . 
That is , (p) 1 = (t) i/ (x) t/ (Ep) i + (p) i is calculated by using the processing 
time of the algorithm number j = 1 stored in the log data storage 3 0, 
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and is stored in the storage device (step S125) . Besides, with respect 
to an optimum algorithm, the number of required processors is set as 
(P)t = INT((p)i + 0.99). Further, Pmrn = Pi is set. 

Next, j is incremented by one (step S127) . Then, the number (p) j 
5 of processors in the case of j is calculated by a following expression, 
and is stored in the storage device (step S129) . 

(P)j = /(T)t/(Ep)3 + (p)j 

Then, it is confirmed whether (p)j < (p)minf ^nd (Ap) j > (p)j (step 
S131) . That is, it is confirmed whether (p)j is minimum and the optimum 

10 number of processors is less than the acceleration ratio (Ap)j of the 
algorithm, that is, whether it is realizable. At the steps S125 and S129, 
since (p)j is simply calculated by the linear extrapolation, whether it 
is realizable is ensured here- In case of (p) j < (p)min and (Ap) j > (p)j, 
the algorithm number j is set to jr- That is, jx = j - Besides, (p)t = 

15 XNT((p)j + 0.99) is set (step S133). 

After the step S131 or the step S133, it is confirmed whether j 
is not less than the algorithm number jmax (step S135) . That is, it is 
judged whether all algorithms are processed (step S135) . If j > jmaxf 
the algorithm number finally specified by ji, the number (p)t of 

20 processors in that case, and other processing results (a set of j, (p) j, 
(Ap)j, (T)j, etc.) are outputted to the output device 110 such as the 
display device (step S137) • On the other hand, if j < jmax/ the procedure 
returns to the step S127. 

By doing so, it is possible to specify an algorithm in which the 

25 number of processors is small within a realizable range and the target 
processing time can be achieved. Besides, in addition to the algorithm, 
which is made optimum in this processing flow, it is possible to select 
an algorithm in which the number of processors is not largely different, 
and tuning can be easily made. 

30 A specific description will be given of an example of two 



algorithms shown in Fig. 35. Firsts a target processing time (t) t is set. 
Next, by using (Ep) j and (x) j of the algorithms, the number (p)j of 
required processors is calculated by the linear extrapolation. In the 
processing flow shown in Fig. 36, in order to obtain (p)j by the linear 
5 extrapolation, Ap(4) in which consideration is given to only the 
redundancy processing is introduced as an upper limit of the number of 
processors, and it is assumed that if Ap(4) > (p)j/ then (p)j can be 
applied. As a result, while limit performance of the algorithm unsuited 
for the parallel processing is 5.000, (p) j is 7 .872, and it is understood 

10 that (p)j cannot be applied for the algorithm unsuited for the parallel 
processing. On the other hand, since limit performance of the algorithm 
suited for the parallel processing is oo, there is a possibility that 
(t) T = 50 can be achieved by 6.618 processors . Accordingly, the algorithm 
jT = 1 suited for the parallel processing can be selected. A first aim 

15 in that case is (p)t = 7 obtained by rounding up 6.618. Until now, there 
are many cases where the algorithm having a short processing time and 
unsuited for the parallel processing is adopted by comparing t=110 with 
120. However, according to the processing flow of Fig. 36, at p = 4, 
the algorithm having a long processing time and suited for the parallel 

20 processing can be selected. 

G- Parallel performance evaluation processing 

In this embodiment, it is possible to prepare log data of parallel 
performance evaluation indexes of all processings in practical use. In 
this log data, if a certain specific processing is made a target, it 

25 becomes possible to obtain specifications (CPU performance, 
communication performance, I/O performance, performance of runtime 
library, etc.) necessary for a dedicated parallel computer system. If 
a processing by all applications is made a target, it becomes also 
possible to prepare specifications necessary for a general-purpose 

30 parallel computer system based on the log. 



For example, in order to improve processing performances of 
processing numbers 1 to 4 shown in Fig. 31, it is understood that it 
is proper to raise communication performance, or to improve CPU 
performance and communication performance in a form that the ratio of 
5 both the performances is kept. The parallel performance evaluation 
processor 27 constitutes a table as shown in Fig. 37 from, for example, 
data stored in the log data storage 30, and outputs it to the output 
device 110 such as the display device. Besides, a processing may be 
performed such that in the parallel performance impediment factors, one 

10 exhibiting a relatively high value in any processings is highlighted. 
Besides, in order to improve the performance of processing 5, it is 
understood that not performance improvement by replacement of the system, 
but tuning of the application program is required- This is because only 
in the processing 5, the parallel performance impediment factor 

15 contribution ratio by the redundancy processing exhibits a large value, 
and the parallel performance evaluation processor 27 may extract a 
characteristic processing as well, to carry out a highlight display, 
for example. 

As a method for determining the communication performance, it is 
20 appropriate that for example, the processing as described in the system 
replacement data processing is executed. That is, a magnification other 
than the magnification of the communication performance is fixed to 1, 
and the processing is executed until the target Ep(4) is cleared. 

Incidentally, if the communication performance is improved while 
25 attention is paid to patterns of the processing numbers 1 to 4 in which 
the number of processings is large, this system becomes a 
general-purpose parallel computer system. On the other hand, if a 
mechanism to reduce the redundancy processing is introduced into a 
computer system while attention is paid to the pattern of the processing 
30 number 5 in which the number of processings is small, it becomes a 
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dedicated parallel computer system. Besides, if tuning of the 
application program of the processing number 5 is performed and the 
redundancy processing is reduced, the general-purpose parallel computer 
system of the processing numbers 1 to 5 is obtained by only improving 
5 the communication performance. 

Conventionally/ since the parallel performance of a parallel 
computer system is greatly changed according to the feature of parallel 
processing of an application program, it has not been easy to develop 
a parallel computer system- As a method for overcoming that, a method 

10 has been often used in which an application program is specified, a 
parallel performance is analyzed, and a parallel computer system suited 
for that is developed. However, in this method, if the application 
program is changed, there is a fear that a system, which cannot exhibit 
any parallel performance, is developed. According to this embodiment, 

15 since it becomes possible to prepare log data of parallel performance 
evaluation indexes of all processings in the practical use, if a certain 
specific processing is made a target on the basis of the log data, it 
becomes possible to prepare specifications (CPU performance, 
communication performance, I/O performance, performance of runtime 

20 library, etc.) necessary for a dedicated parallel computer system- 
Besides, if all processings are made targets, it becomes also possible 
to prepare specifications necessary for a general-purpose parallel 
computer system based on the log. 

Besides, conventionally, whether or not means for quantitatively 

25 grasping the parallel performance impediment factors of the parallel 
computer system is incorporated, depends greatly on systems, and there 
is also a system which does not have any means for quantitatively 
grasping the parallel performance impediment factors. In this 
embodiment, as indicated by the expression (7) , since it has a function 

30 to arbitrarily add a factor from a state where there is no parallel 



performance impediment factor, it is possible to raise evaluation 
accuracy by adding a factor measurement function at the time of upgrade 
of the system after selling - 

Further, in conventional performance evaluation indexes, for 
5 example, flop/s, Mop/s, tpmC, etc., there is one, which can be applied, 
and one, which cannot be applied according to the kind of the application 
program. In this embodiment, since an index is expressed by a time ratio, 
it is effective for all application programs, and the performance 
evaluation can be suitably executed. Further, although some 

10 conventional parallel performance evaluation method can be applied to 
only a specific parallel processing, according to this embodiment, it 
can be applied to all parallel processings. 

Although the embodiment of the invention has been described, by 
this, with respect to the parallel efficiency expressing the performance 

15 of the parallel processing, a ratio at which it is lowered can be 
expressed by the parallel performance evaluation indexes, that is, the 
load balance contribution ratio, the virtual parallelization ratio and 
the parallel performance impediment factor. The load balance 
contribution ratio is added to the parallel performance evaluation index, 

20 and it becomes possible to make a parallel performance evaluation of 
all parallel processings. 

Besides, if the expression (8-2) is used, in the case where Rp(p) 
is approximately 1, since the estimated value t(1) is not necessary for 
the calculation of the parallel efficiency, it becomes possible to make 

25 an accurate (in the meaning that the estimated value x (1) is not included) 
parallel performance evaluation for all parallel processings including 
the parallel processing by the grid or the cluster in which t(1) can 
not be measured - 

Further, if the expressions (9-1) and (9-2) are used, even in the 

30 case of Rp(p) < 1, by estimating x(l) for the calculation of the parallel 
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efficiency, it becomes possible to make a parallel performance 
evaluation for all parallel processings including the parallel 
processing by the grid or the cluster in which t(1) can not be measured . 

The parallel processing impediment factor peculiar to the target 
5 parallel computer system can be introduced at any time in the form of 
the expression of the expression (7) , and a detailed performance 
evaluation can be easily made. Further, the contribution ratio of the 
parallel performance impediment factor can be grasped by a percentage 
with respect to the parallel efficiency, and it becomes possible to make 

10 an intuitive parallel performance evaluation - 

Besides, since the contribution of the load balance becomes clear 
by a numerical value as a ratio with respect to the parallel efficiency, 
the contribution of the load balance to the parallel efficiency, which 
has not been capable of being estimated until now, can be specifically 

15 indicated. 

Besides, not only the parallel performance index is calculated 
and is exhibited, but also the number of processors performing efficient 
processing can be determined by using the parallel efficiency determined 
by the processing time measurement- Further, in view of the efficiency 
20 of parallel processing, it is possible to consider the increase or 
decrease of processors. 

Further, it is possible to theoretically consider the 
introduction of a new parallel computer system different in performance 
specification. Besides, usage efficiency management in system operation 
25 can be made using the parallel performance evaluation index. 

Although the embodiment of the invention has been described, the 
invention is not limited to this. For example, the functional block 
diagram of Fig. 19 is an example, and each functional block does not 
necessarily correspond to a program module. Besides, all of the 
30 processor number optimizer 21, the processor add-on estimation 
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processor 22, the system replacement data processor 23, the operational 
efficiency data processor 24, the tuning processor 2b, the algorithm 
selection processor 26, and the parallel performance evaluation 
processor 27 may not be provided, there is a case where all of them are 
5 provided, and there is also a case where none of them are provided. 
Further, there is also a case where they are provided in an arbitrary 
combination . 
[Calculation Example] 

The above-described embodiment can be applied to all parallel 

10 processings (a grid in a homo-structure in which memories, networks and 
CPU performances are the same, or a hetero-structure in which they are 
different from one another, a cluster or a distributed memory, or SMP 
(Symmetric Multiprocessing) , SMP + distributed memory, NUMA (Nonuniform 
Memory Access), etc.) . Hereinafter, calculation examples with respect 

15 to typical modes will be described. 

(1) Grid in homo-structure, etc. = 0) 

In the case where a processing is performed by a grid or a cluster, 
communication occurs since a network is used to assign a processing to 
each processor and to collect processing results, however, this does 

20 not occur when a processing is performed by one processor. Such a 
processing is a processing of %i,j(l) = 0. Here, it is assumed that a 
parallel performance impediment factor is only communication, and 
parallel performance of the processing of %i,c(l) = 0 is evaluated. For 
example, a description will be given of a case where a measurement result 

25 of elapsed time as shown in Fig. 38 is obtained. 

Following calculations are carried out from the expression (3) . 
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Ti(4) =5 + 25 + 15 + 15 + 10 = 70 
T2(4) = 10 + 20 + 10 + 10 + 10 = 60 
T3(4) =15 + 15 + 10 + 15 + 10 = 65 
T4(4) = 10 + 10 + 10 + 25 + 10 = 65 

Following calculations are respectively carried out from the 
expressions (5), (6-1), (6-2) and (7) . 

70 + 60 + 65 + 65 
* 70x4 

40 + 30 + 30 + 35 ^ J „„„ 
40 + 30 + 30 + 35 

30 + 30 + 35 + 30 

'-^ ^ 260 

With respect to the parallel efficiency, following calculations 
are respectively carried out from the expressions (4-4) , (4-5) , (8-2) , 
(9-1), and (9-2) in sequence. 

EJ4) = 0.9286 x—^—x(l - 0.4808) = 0.4821 
^ 1.000 

EJ4) = 0.9286 X — ^ — x(l - 0.4808) = 0.4821 

Ep(4) = 0.9286 x(l -0.4808) = 0.4821 

..N /^«^o^ 1 40 + 30 + 30 + 35 ^ 

EJ4) = 0.9286 X x = 0.4822 

1.000 70 + 60 + 65 + 65 

Ep(4) = ((40 + 30 + 30 + 35) + 0)/(70 x 4) = 0.482 1 

The above calculated parallel performance evaluation indexes are 
collected as shown in Fig- 39. Since Ap(p) = oo, the possibility of 
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performance improvement is infinite when a parallel processing is 
performed by p = qo processors, however, an actual performance 
improvement Ep(4) -p at the time when a processor number p = 4 is applied 
is 1.928. The reason is that the parallel efficiency Ep(4) becomes 93% 
5 (Rb(4) = 0-9286) by the load balance contribution ratio, and is further 
lowered by 48% (Rc(4) = 0.4808) by communication. 

(2) Grid in homo-structure, etc. (Xi,red(1) ^ 0) 

In numerical calculation, a parallel processing is often 
performed in the so-call data parallel in which an application program 

10 is copied to all processors, and indexes etc. of a loop processing are 
allocated, to respective processors to share the processing. In the data 
parallel, for example, a processing which can not be processed in 
parallel remains between loops. When this processing is performed by 
all processors, since the contents of the processing are the same, this 

15 is called a redundancy processing. The feature of the redundancy 
processing is that even in the case where a processing is not a parallel 
processing, XifRED(l) ^ 0 is always established for a necessary processing . 
Here, it is assumed that a parallel performance impediment factor is 
only a redundancy processing, and the parallel performance of a 

20 processing having Xi,red(1) ^ 0 will be evaluated. For example, a 
description will be given of a case where a measurement result of elapsed 
time as shown in Fig. 40 is obtained. 

Following calculations are carried out from the expression (3) . 

Ti(4) = 8 + 35 + 10 + 20 + 5 = 78 
T2(4) = 10 + 33 + 11 + 22 + 7 = 83 
T3(4) = 7 + 37 + 10 + 19 + 4 = 77 
T4(4) = 11 + 30 + 9 + 18 + 6 = 74 

25 Following calculations are 



respectively carried out from the 
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expression (5), (12-1), (6-1), (6-2) and (7) 

' 83x4 
Xx,RED^^ = • (23 + 28 + 21 + 26) = 24.5 



55 + 55 + 56 + 48 
(55 + 55 + 56 + 48) + 24.5 



Rp{A) = I^Z^^^ . = 0.8973 



AJ4) = ^ = 9.737 

^ 1-0.8973 

_ 23 + 28 + 21 + 26 

Rrfd(^) = = 0.3 141 

78 + 83 + 77 + 74 

with respect to the parallel efficiency, following calculations 
are respectively carried out from the expressions (4-4) , (4-5) , (9-1) 
5 and (9-2) in sequence. 

EJ4) = 0.9398X — ? — x(l - 03 141) = 0,7184 
^ 0.8973 

E„(4) = 0.9398 X ? x(l -0.3141) = 0.7184 

^ 1-1/9.737 

^,(4) = 0.9398x-l-x ^^^^^^^^^^^ = 0.7184 
0.8973 78 + 83 + 77 + 74 

£^ (4) = ((55 + 55 + 56 + 48) + 24.5)7(83 x 4) = 0.7 1 84 

The above calculated parallel performance evaluation indexes are 
collected as shown in Fig. 41. Here, since Ap(p) = 9.737, a parallel 
processing of p > 9 is meaningless. An actual performance improvement 
10 Ep -p at the time when the number of processors , p = 4 , is applied is 2 . 874 - 
The reason is that the parallel efficiency Ep becomes 94% (Rb(4) = 0.9398) 
by the load balance contribution ratio, and is further lowered by 31% 
(Rred(4) = 0.3141) by the redundancy processing. 

(3) Grid in homo-structure, etc. {Xi,j(l) 0: except redundancy 

15 processing ) 

For example, a processing time of a communication library is 
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constituted by a network communication and an operation. This operation 
time is treated as Xi,c(l) . Here, it is assumed that a parallel 
performance impediment factor is only communication, and the parallel 

performance of a processing of Xi,c{l) 0 will be evaluated. For example, 
a description will be given of a case where a measurement result of 
elapsed time as shown in Fig. 42 is obtained. 

Following calculations are carried out from the expression (3) . 

Ti(4) = 2 + 5 + 25 + 2 + 15 + 15 + 2 + 10 = 76 
^2(4) =3 + 10 + 20 + 2 + 10 + 10 + 2 + 10 = 67 
T3(4) =2 + 15 + 15 + 2 + 10 + 15 + 2 + 10 = 71 
T4(4) = 2 + 10 + 10 + 2 + 10 + 25 + 2 + 10 = 71 

Following calculations are respectively carried out from the 
expression (5), (12-1), (6-1), (6-2) and (7). 

76 + 67 + 71 + 71^ 
76x4 

;i:i.c(l)=^-(6 + 7 + 6 + 6) = 6.25 

^(4)= 40 + 30 + 30 + 35 ^^^^^^ 
^ (40 +30 +30 +35) + 6.25 

AJ4)= ~ = 22.6 

^ 1-0.9557 

285 

With respect to the parallel efficiency, following calculations 
are respectively carried out from the expressions (4-4), (4-5), (9-1) 
and (9-2) in sequence. 
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0.9375 X 



1 



x(l-0.5263) = 0.4647 



0.9557 



0.9375 X 



1 



X (1-0.5263) = 0.4647 



1-1/22.6 




((40 + 30 + 30 + 35) + 6.25)/(76x 4) = 0.4646 



The above calculated parallel performance evaluation indexes are 



collected as shown in Fig. 43. Here, since Ap(p) = 22.57, a processing 
should be performed at p < 22 . An actual performance improvement Ep'p 
at the time when a processor number p = 4 is applied is 1.859. The reason 
is that the parallel efficiency Ep is lowered by 53% (Rc = 0.5263) by 
communication. The load balance contribution ratio is 94% (Rb(4) = 
0.9375) , and the load balance is not a main factor to impede the parallel 
performance of this case. This is different from the example (1) in that 
Ap (p) comes to have a finite value because of Xi,c 0- 



processing is changed with the increase of processors. When this is 
regarded as Xi,c{p) and is taken in the parallel performance evaluation, 
it becomes possible to integrate the nimber of operations, which varies 
according to the number of processors, into evaluation. 
(4) Grid in homo-structure, etc. (in a case where there is an idling: 
also called wait) 



other processors use the result, the other processors cannot start a 
next processing until the processing is ended- For example, this applies 
to a case where only the specific processor can access a database (DB) - 
In Fig- 44, a processor #1 performs this processing (y*)- The other 
processors are in waiting states during the DB processing- It is possible 
to evaluate the parallel performance in the case where there is an idling 



There is a case where an operation included in the communication 



In the case where a specific processor performs a processing, and 
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processing while a CPU is made to wait as stated above. It is assumed 
that the measurement result of elapsed time as shown in Fig- 44 is 
obtained. 

Following calculations are carried out from the expression (3) . 

Ti(4) = 5 + 25 + 10 + 10 + 3 + 35 + 5 = 93 
T2(4) = 10 + 20 + 10 + 10 + 5 + 30 + 4 = 89 
T3(4) = 15 + 15 + 10 + 10 + 4 + 33 + 3 =90 
T4(4) = 10 + 10 + 10 + 20 + 6 + 29 + 4 = 89 

5 

Following calculations are respectively carried out from the 
expressions (5), (12-1), (6-1), (6-2) and (7). 

ie,(4)=^^±^^+90 + 89^^^^^^ 
^ 93x4 

80 + 50 + 48 + 39 ^^^^^ 
^ 80 + 50 + 48 + 39 

23 + 29 + 32 + 30^0.3158 
^ 361 

^^(4) = «±l«±l«±20 = o.ll08 

With respect to the parallel efficiency, following calculations 
10 are respectively carried out from the expressions (4-4) , (4-5) , (8-2) , 
(9-1) and (9-2) in sequence. 
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EJ4) = 0.9704 x-^x(l - 0.3158 -0.1 108) = 0.5564 
^ 1.000 

^ (4) = 0.9704 X — - — x(l -0.3158 -0.1108) = 0.5564 

Ep(4) = 0.9704x(l - 0.3 158 - 0.1 108) = 0.5564 

..^ 1 70 + 50+48 + 39 ^ ^^^^ 
F (4) = 0.9704 X X = 0.5564 

^ 1.000 361 

£^(4) = ((70 + 50 + 48 + 39) + 0)/(93 x 4) = 0.5565 

The above calculated parallel performance evaluation indexes are 
collected as shown in Fig. 45. Since Ap(4) = oo, the possibility of 
performance improvement at the time when the parallel processing is 
5 performed by p = 00 processors is infinite, however, an actual performance 
improvement Ep-p at the time when p = 4 is applied is 2.226. The reason 
is that the parallel efficiency Ep is lowered by 32% (Rc(4) = 0.3158) 
by communication and is lowered by 11% (Rw(4) = 0.1108) by idling- The 
load balance contribution ratio is 97% (Rb(4) = 0.97043), and the load 
10 balance is not a main factor to impede the parallel performance in this 
case . 

(5) Grid in homo-structure, etc. (a case where there is an idling since 
there is another processing) 

In the case where a processing is performed in a grid or a cluster, 

15 each processor is seldom used only by its own processing, and in general, 
it coexists in plural processings. In that case, a waiting occurs by 
an interrupt of another processing. This is shown in Fig. 46. The 
parallel performance in the case where there is a waiting since another 
processing exists as stated above, will be evaluated. It is assumed that 

20 a measurement result of elapsed time as shown in Fig. 46 is obtained. 

Following calculations are carried out from the expression (3) . 
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Ti(4) = 5 +25 + 50+35+5 = 120 
T2(4) = 10 + 20+5 + 10 + 10 + 30 + 4 = 89 
T3(4) = 15 + 15 + 10 + 10 + 10 + 33 + 3 = 96 
T4(4) = 10 + 10 + 20+29 + 4 = 73 

Following calculations are respectively carried out from the 
expressions (5), (12-1), (6-1), (6-2) and (7), 

120 + 89 + 96 + 73^^^3^3 
^ 120x4 

60+60 + 58 + 39 ^^^^^ 
^ 60+60+58 + 39 

10 + 14 + 18 + 14 ^^ ^^3^ 
^ 378 

50 + 15 + 20 + 20 
^ 378 

With respect to the parallel efficiency, following calculations 
are respectively carried out from the expressions (4-4) , (4-5) , (8-2) , 
(9-1) and (9-2) in sequence. 

E„ (4) = 0.7875 X-^x (1 - 0.148 1 - 0.2778) = 0.452 1 
^ 1.000 

En (4) = 0.7875 X — ? — X (1 - 0. 141 8 - 0.2778) = 0.452 1 

^ l-l/oo 

Ep (4) = 0. 7875 X (1 - 0. 1 4 1 8 - 0.2778) = 0.452 1 

r. .^^o-,^ 1 60 + 60 + 58 + 39 „ 

EJ4) = 0.7875 X X = 0.4521 

^ 1.000 378 

Epi"^) = ((60 + 60 + 58 + 39) + 0)/(120x4) = 0.452 1 

The above calculated parallel performance evaluation indexes are 
collected as shown in Fig. 47. Since Ap(4) = oo, the possibility of 
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performance invproveitient when the parallel processing is performed by 
p = 00 processors is infinite, however, an actual performance improvement 
Ep-p at the time when p = 4 is applied is 1.8 08 - The reason is that the 
parallel efficiency Ep becomes 79% (Rb(4) = 0.7875) by the load balance 
5 contribution ratio, and is lowered by 28% (Rw(4) = 0.2778) by the waiting 
of the time sharing, and is further lowered by 14% (Rc = 0.1418) by the 
communication. Since Rw(4) occurs by another processing, Rb(4) becomes 
the load balance contribution ratio in which consideration is given to 
the whole system. In the case where another processing exists, it is 

10 necessary to pay attention to Rb(4) and Rw{4) - Even if Rb(4) = 1, if Rw(4) 
is large, it means that a crowded system is used, and Ep comes to have 
a low value. When the grid or the cluster processing is developed, by 
selecting the system so that Rb{4) approaches 1 and Rw becomes 0, it 
becomes possible to efficiently perform the parallel processing. Such 

15 is first understood by this embodiment. 

Incidentally, as a method of distinguishing between a 
self -processing (target processing) and another processing (processing 
outside a target) , there is a method in which a CPU time and an elapsed 
time are measured. In general, the CPU time is a time of only the 

20 self-processing, and the elapsed time is a time including the other 
processing. Accordingly, there is a case where the relation ^^waiting 
time for time sharing = elapsed time - CPU time" is established. 
(6) Grid in homo-structure, etc. (case of a data parallel processing) 
The data parallel processing is a parallel processing in which 

25 the procedures of the respective processors are identical and data are 
different, for example, data of 1000 items are divided into 250 items 
by four processors to perform processings. With respect to a processing 
which can not be processing in parallel, there is a case where data of 
all processors are made the same, that is, a redundancy processing is 

30 performed, and there is a case where the processing is performed by a 



certain processor, and a result is broadcasted to all processors. Here, 
the parallel performances of both will be evaluated. 
[Data parallel processing using redundancy processing] 

It is assumed that a measurement result of elapsed time as shown 
in Fig. 48 is obtained. Besides, it is assumed that a processing time 
is Xi,c = 0. 

Following calculations are respectively carried out from the 
expressions (3), (5), (12-1), (6-1), (6-2) and (7). 

-ri(4) =T2(4) = T3(4) =T4(4) = 5 + 20+5 + 10 + 2 + 30 + 3 = 75 

75 + 75 + 75 + 75 ^ J ^^QQ 
' 75x4 

;ti,/tEDa)=^(io+io+io+io)=io.oo 

• 50 + 50 + 50 + 50 ^^^3^^ 

^ (50 + 50 + 50 + 50) + 10 

y4„(4)= ^ = 21.01 

^ 1-0.9524 

„ 10 + 10 + 10 + 10 

^MDi^^ = = 0. 1 333 

15 + 15 + 15 + 15 
^ 300 

with respect to the parallel efficiency, following calculations 
are respectively carried out from the expressions (4-4) , (4-5) , (9-1) , 
and (9-2) in sequence. 

£:„(4) = 1 .OOOx — i — x(l- 0.1333-0.2000) = 0.7000 
^ 0.9524 

£„(4) = 1 .OOOx ^ X (1 - 0.1333-0.2000) = 0.7000 

^ 1-1/21.01 

EAA) = 1 .000x-L-x^«^iO±^O+5O ^ ^^^^^ 
^ 0.9524 75 + 75 + 75 + 75 

^p(4) = ((50 + 50 + 50 + 50) + 1 0)/(75 X 4) = 0.7000 
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The above calculated parallel performance evaluation indexes are 
collected as shown in Fig. 49. Since Ap{4) = 21, the number of processors 

should be selected within a range of p < 21. An actual performance 
improvement Ep*p at the time when the number of processors, p = 4, is 
5 applied is 2 .800 . The reason is that the parallel efficiency Ep is lowered 
by 20% (Rc = 0-2000) by communication and by 13% (Rred(4) = 0 . 1333) by 
redundancy processing. 

[Data parallel processing in which a portion, which can not be processed 
in parallel, is processed by a specific processor] 

10 There is a case where a portion, which can not be processed in 

parallel, is not redundancy-processed, but is processed by a specific 
processor- Fig- 50 shows a case where instead of the redundancy 
processing of Fig. 48, a processing for portion of y* is performed by 
only the processor #1, and the result is broadcasted to the respective 

15 processors. Naturally, during that, the other processors wait for the 
result of the processor #1. Besides, here, although is treated as a 
parallel processing, if it is added to the parallel processing 
impediment factor as a sequential processing, a more detailed parallel 
performance evaluation can be made. However, for that, it becomes 

20 necessary to judge whether the processing of y' is a sequential 
processing or a parallel processing. It is assumed that a measurement 
result of elapsed time as shown in Fig. 50 is obtained- Besides, Xi#c 
= 0 is assumed. 

Following calculations are respectively carried out from the 
25 expressions (3), (5), (12-1), (6-1), (6-2) and (7). 
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^i(4) =72(4) = T3(4) = T4(4) = 5 + 20 + 5 + 10 + 2 + 30 + 3 = 75 

75 + 75 + 75 + 75 
^ 75x4 

50 + 50 + 50+50 ^^^^p 
(50 + 50 + 50 + 50) + 0 



1 



1-1 



15 + 15 + 15 + 15 ^Q^pPQ 
^ 300 

«^(4) = «±l«±10±12 = 0.1000 

^ 300 

with respect to the parallel efficiency, following calculations 
are respectively carried out from the expressions (4-4) , (4-5) , (9-1) , 
and (9-2) in sequence. 

E„ (4) = 1 .000 X — ^ X (1 - 0. 1 000 - 0 .2000) = 0.7000 
^ 1.000 

EM = l.OOOx — ^ — x(l -0.1000- 0.2000) = 0.7000 

^ l-l/oo ^ ^ 

£,(4) = l.OO0x-!-xi°±^?±l»±^2 = o.70OO 
^ 1.000 75 + 75 + 75 + 75 

(4) = ((60 + 50 + 50 + 50) + 0)/(75 x 4) = 0.7000 

The above calculated parallel performance evaluation indexes 
are collected as shown in Fig. 51. Although the performance improvement 
at the time when the parallel processing is performed is infinite since 
Ap(4) =00, an actual performance improvement Ep*p at the time when p = 
4 is applied is 2.800. The reason is that the parallel efficiency Ep is 
lowered by 20% (Rc(4) = 0.2000) by communication, and by 10% (Rw(4) = 
0.1000) by waiting- Values of Rp(4) , Ap(4) , Rred(4) , and Rw are different 
between Figs- 49 and 51. On the other hand, Rb(4) and Ep(4) become the 
same value- In Fig. 51, since the portion which can not be processed 



in parallel is evaluated as the parallel processing, Rp(4) = 1- Besides, 
the redundancy processings of the respective processors #2, 3, and 4 
are changed to the waiting, and are substituted by the parallel 
performance impediment factor Rw(4) . 
5 (7) Grid in homo-structure, etc. (case of control parallel processing) 

In the control parallel processing, procedures of respective 
processors are normally different from each other. Thus, there often 
occurs a parallel processing in which procedure times of the respective 
processors are irregular. Here, the parallel performance of the control 
10 parallel will be evaluated- It is assumed that a measurement result of 
elapsed time as shown in Fig. 52 is obtained. Besides, Xi,c = 0 is assumed. 

Following calculations are carried out from the expression (3) . 

Ti(4) = 53 + 15 + 3 + 12 + 3 + 20 = 106 
T2(4) = 2 + 30 + 5 + 10 + 4 + 2 + 15 + 3 = 71 
T3(4) = 4 + 20 + 7 + 20 + 2 + 30 + 3 = 86 
T4(4) = 6 + 20 + 5 + 20 + 2 + 30 + 3 = 86 

Following calculations are respectively carried out from the 
15 expressions (5), (6-1), (6-2), and (7). 

106 + 71-^86+86^^3^3^ 
' 106x4 

^ 73 + 55 + 70 + 70 
AM = ^ 



1-1.000 
^-(4) = ^±^ = 0.0344 

6 + 10 + 12 + 10 ^ 
^ 349 

ij^(4) = 27 + 4 + 0 + 0^QQ3gg 
349 
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with respect to the parallel efficiency, following calculations 

are respectively carried out from the expressions (4-4) , (4-5) , (8-2) , 

(9-1) and (9-2) in sequence. 

J_ 
1.000 



(4) = 0.823 IX:;-^;;^ X (1 - 0.0344 - 0.1089 - 0.0888) = 0.6321 



£: (4) = 0.8231x — ^ — x(l-0.0344-0.1089-0.0888) = 0.6321 

^ 1— l/oo 

Ep(4) = 0.823 lx(l - 0.0344 - 0. 1089 - 0.0888) = 0.6321 

EJ4) = 0.8231x-i- x-2i±^5±20±70 ^ ^^^^^ 
^ LOGO 106 + 71+86 + 86 

E^{4) = ((73 + 55 + 70 + 70) + 0)/(106x4) = 0.6321 

5 The above computed parallel performance evaluation indexes are 

collected as shown in Fig- 53- Although the performance improvement at 
the time when the parallel processing is performed is infinite since 
Ap(4) = 00, an actual performance improvement Ep-p at the time when p = 
4 is applied is 2-528. The reason is that the parallel efficiency Ep 

10 becomes 82% (Rb(4) = 0-8231) by the load balance contribution ratio, and 
is further lowered by 23% (Rtc(4) + Rc(4) + Rw(4) = 0.0344 + 0.1089 + 
0.0888) in total by task generation, communication and waiting. 

In order to improve the parallel performance, the parallel 
performance indexes are compared, and room for improvement is considered 

15 in descending order of influence on the lowering of the parallel 
performance. In the case of Fig. 53, this becomes order of Rb(4), Rc(4), 
Rw{4) , and Rtc(4) . If Rb(4) = 1, then Ep ( 4) -p = 3 - 071 (= 2-528/0-8231) . 
Thus, for example, an attempt is made to change a processing schedule 
so that the processing time of the processor #1 becomes the same as that 

20 of another processor. Next room for improvement is a reduction of Rc(4) . 
As a reduction method, for example, it is conceivable to make a 
replacement by such hardware as to double the communication performance . 
In that case, calculations as set forth below are carried out - 
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First, following calculations are carried out from the expression 

(3) . 

Ti(4) = 53 + 15+3/2 + 12 + 3/2+20 = 103 
t2(4) = 2 + 30 + 5/2 + 10+4 + 2/2 + 15 + 3/2 = 66 
T3(4) = 4 + 20+7/2+20 + 2/2 + 30+3/2 = 80 
T4(4) = 6 + 20+5/2 + 20 + 2/2 + 30+3/2 = 81 

Following calculations are respectively carried out from the 
expressions (5) / (6-1) , (6-2) and (7) . 

«,(4) = m±i5±««±ll = 0.8010 
103x4 

73 + 55 + 70 + 70 ^^^^^ 
^ 73 + 55 + 70 + 70 

^ (4)= i = oo 

^ 1-1.000 

_ ... 0 + 2+4 + 6 nt^i^A 
RrpMA,) = = 0.0364 

(6il0il2il0V2^^,„3,^ 
330 

^^(4) = 2Z±i±£±0=o.0939 

^ 330 

with respect to the parallel efficiency, a following calculation 
is carried out from the expression (4-4) - 

}_ 

1.000 

Further, a following calculation is also carried out. 



Ep(A) = 0.8010x7-;i;;;-x (1 - 0.0364 - 0.0576 - 0.0939) = 0.6505 



EM ■ p = 0.8010x—!—x(l- 0.0364- 0.0576- 0.0939) • 4 = 0.6505 • 4 = 2.602 
^ 1.000 

If the performance of communication is improved as described 
above, and the load balance is changed to Rb(4) = 1, a following 
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calculation is carried out. 

(4) - p = 1 x-^— X (1 - 0.0364 - 0.0576 - 0.0939) •4 = 0.8121-4 = 3.248 
^ 1.000 

As described in ^^E. tuning processing'', in this embodiment, it 
is possible to predict the parallel performance at the time when the 
5 respective parallel performance impediment factors are tuned and 
improved. In conventional tuning, since a target value is made a 
processing time, there is a case where an impossible target value is 
set- On the other hand, in this embodiment, a reasonable target setting 
becomes possible by using Ep- Further, in this embodiment, since the 

10 parallel efficiency and the like can be calculated by one measurement 
result, it is possible to shorten a performance evaluation time at the 
tuning- Further, in the conventional tuning, when input data or a 
processing function is changed, the processing times measured up to that 
time with respect to the respective parallel performance impediment 

15 factors can not be used for the performance evaluation- Accordingly, 
independent parallel performance evaluation has been performed for 
respective input data or processing functions. In this embodiment, all 
of the performance evaluation indexes have the form of the ratio, and 
the parallel performances for different input data or processing 

20 functions can be compared. 

(8) Grid in homo-structure, etc. (a case where a master-slave processing 
is performed in control parallel) 

In the control parallel processing, procedures of the respective 
processors are normally different from each other. In the case of a 

25 master-slave processing, one processor becomes a master to control other 
processors, and plural processors execute processings in accordance 
with its instructions. Here, parallel performance in the case where the 
processor #1 is the master processor will be evaluated. It is assumed 
that a measurement result of elapsed time as shown in Fig. 54 is obtained- 
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Besides, Xi,c = 0 is assumed. 

Following calculations are carried out from the expression (3) . 

ri(4) = 2+12 + 2 + 5 + 2+3 + 2 + 5 + 2 + 5 + 2+43+6 = 91 
T2(4) = 2+ 2 + 10 + 2 + 5 + 2 + 17 + 2 = 42 
T3(4) =4+2 + 20 + 2 + 5 + 2 + 50 + 3 = 88 
T4(4) =6+2 + 80 + 3 = 91 

Following calculations are respectively carried out from the 
expressions (5) , (6-1) , (6-2) and (7) . 

;?,(4)=^i±i2±88 + 91^^^^^^ 
*^ 91x4 

10 + 27 + 70 + 80^^^^^ 
10 + 27 + 70 + 80 

AJA)= i = oo 

P 1-1.000 

„ 0 + 2 + 4+6 

(4) = = 0.0385 

rev J 

^(4) = 18±i±9±5 = 0.1282 
^ 312 

,?^(4)=:^3 + 5 + 5 + 0^^2340 

with respect to the parallel efficiency, following calculations 
are respectively carried out from the expressions (4-4) , (4-5) , (8-2) , 
(9-1) and (9-2) in sequence. 



101 



0.857 Ix 



1.000 



X (1 - 0.03 85 - 0. 1 282 - 0.2340) = 0.5137 



Ep(4) 



0.857 Ix 



l-l/~ 



X (1 - 0.0385 - 0. 1 282 - 0.2340) = 0.5 1 37 



Ep(4) 



0.857 1 X (1 - 0.03 85 - 0. 1 282 - 0.2240) = 0.5 1 3 7 



Ep(4) 



0.8571x-i-xl±!l^Z±2^ = 0.5137 
1.000 312 



Ep(4) 



(10 + 27 + 70 + 80) + 0)/(91x4) =0.5137 



The above calculated parallel performance evaluation indexes are 



collected as shown in Fig. 55. Since Ap(4) = oo, the performance 
improvement at the time when the parallel processing is performed by 
5 p = 00 processors is infinite, however, an actual performance improvement 
Ep'p at the time when p = 4 is applied is 2.055. The reason is that the 
parallel efficiency Ep becomes 86% (Rb(4) = 0.8571) by the load balance 
contribution ratio, and is further lowered by 23% (Rw(4) = 0.2340) by 
waiting, and by 17% (Rtc{4) + Rc(4) = 0.0385 + 0.1282) in total by task 

10 generation and communication. In the case where the master-slave 
processing is performed, it is known that the waiting time of the master 
processor has an important influence on the whole processing. In this 
embodiment, the influence of the waiting on the performance is 
quantitatively grasped, and it is possible to judge whether the 

15 master-slave processing is effectively performed. 

(9) Grid in homo-structure, etc. (case of mixture of data parallel and 
control parallel) 



parallel are mixed, since it is difficult to make the load balance be 
20 kept, the processing is not used in normal operations . In this embodiment, 
it becomes possible to make a parallel performance evaluation in such 
a case as well. Since this embodiment provides the performance 
evaluation indexes for control of the processing, it provides a 



As for a processing in which the data parallel and the control 
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practical evaluation method for such a processing. Here, there is 
evaluated parallel performance in the case where processors #1 to #4 
are operated in the control parallel, processors #5 to #8 are operated 
in the data parallel, and the processor #1 is made the master processor. 
5 It is assumed that a measurement result of elapsed time as shown in Fig, 

56 is obtained. Besides, Xi,c = 0 is assumed. 

Following calculations are carried out from the expression (3) . 

x:i(8) = 2 + 12 + 2 + 5 + 2 + 3 + 2 + 5 + 2 + 5 + 2 + 434-6 = 91 
T2(8) = 2 + 2 + 10 + 2 + 5 + 2 + 17 + 2=42 
r3(8) = 4 + 2 + 20 + 2 + 5 + 2 + 50+3 = 88 
T4(8) = 6 + 2 + 80 + 3 = 91 

t5(8) = T6(8) = T7(8)=T8(8) = 2 + 2 + 30 + 3 + 10+ 2 + 40 + 2 = 91 

Following calculations are respectively carried out from the 
10 expressions (5), (12-1), (6-1), (6-2) and (7). 

^ 91-H424.88.-91H-91 + 91-H91^91 ^ 676 ^ ^^86 
^ 91x8 728 

^ 10 + 27 + 70+80 + 70+70 + 70 + 70 ^467 ^^^^^^ 

^ ^~(10 + 27 + 70+80+70+70 + 70+70) + 10~477~ 

AJS) = ^ = 47.62 

^ 1-0.9790 

676 

^ 0 + 2 + 4 + 6 + 2 + 2 + 2 + 2 ^ ^ ^^^96 
^ 676 

^ 18 + 8 + 9 + 5 + 9 + 9 + 9 + 9 ^ ^ ^ ^^4 
^ 676 

iV(8) = ^3 + 5 + 5 + 0 + 0 + 0+0 + 0^Q^QgQ 
676 

with respect to the parallel efficiency, following calculations 
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are respectively carried out from the expressions (4-4) , (4-5) , (8-2) , 
(9-1) and (9-2) in sequence. 

^„(8) = 0.9286X — i — x(l -0.0592-0.0296-0.1 124-0.1080) = 0.6552 
^ ^ 0.9790 

Ep (8) = 0.9286X ^ x (1 - 0.0592 - 0.0296 - 0.1 124 - 0.1 080) = 0.6552 

^ 1 — 1/47.62 

Ep (8) = 0.9286x(l - 0.0592 - 0.0296 - 0.1 124 - 0. 1 080) = 0.6552 

EJ%) = 0.9286X — ? — X— = 0.6553 
^ 0.9790 676 

Ep{^) = (467 +10)/(91x8) = 0.6552 

The above calculated parallel performance evaluation indexes are 
5 collected as shown in Fig. 57. Since Ap(8) = 47.62, the parallel 
processing should be performed at p < 47. An actual performance 
improvement Ep*p at the time when the number of processors, p = 8, is 
applied is 5.242. The reason is that the parallel efficiency Ep becomes 
93% (Rb(8) = 0.9286) by the load balance contribution ratio, and is 

10 further lowered by 11% (Rw(8) = 0.1080) by waiting, by 11% (Rc(8) = 
0.1124) by communication, and by 9% (Rred(8) + Rtc(8) = 0.0592 + 0.0296) 
in total by redundancy processing and task creation. As stated above, 
this embodiment can be applied to the parallel processing system of the 
mixture of the data parallel and the control parallel - 

15 (10) Case where there is a redundancy processing in hetero-structure 

of grid etc. (Xi,red ^ 0) 

In many cases, processors connected through the grid or the 
cluster are different in CPU capability. This is called a 
hetero-structure- This embodiment can also be applied to the case of 
20 the hetero-structure- Here, in example (2), the parallel performance 
in the case where the processor #1 has half performance will be evaluated. 
It is assumed that a measurement result of elapsed time as shown in Fig- 
58 is obtained. 

Following calculations are carried out from the expression (3) - 

104 



Ti(4) = 16 + 70+20+40+10 = 156 
^2(4) = 10+33+1 1 + 22+7 = 83 
r3(4) = 7 + 37+10+19+4 = 77 
r4(4) = 11+30+9+18+6 = 74 

Following calculations are respectively carried out from the 
expressions (5), (12-1), (6-1), (6-2) and (7). 

156^83^77^74^ 390 ^^^^5^ 
' 156x4 624 

X\MD (1) = ;^ • (46 + 28 + 2 1 + 26) = 30.25 

^ 110^55 + 56 + 48 ^ _26^ ^ ^ 3^33 
(110 + 55 + 56 + 48) + 30.25 299.3 

y4„(4) = ^ = 9.881 

^ 1-0.8988 

P ... 46 + 28 + 21 + 26 121 ..... 

RrfdW = = = 0.3 103 

^ 156 + 83 + 77 + 74 390 

With respect to the parallel efficiency, following calculations 
are carried out from the expressions (4-4) , (4-5) , (9-1) and (9-2) in 
sequence . 

EJ4) = 0.6250X — ^- — x(l -0.3 103) = 0.4796 
^ 0.8988 

EJ4) = 0.6250X x(l - 0.3 103) = 0.4796 

' 1-1/9.881 ^ ^ 

EAA) = 0.6250X — ^ — x— = 0.4796 
^ 0.8988 390 

Ep{A) = (269 + 30.25)7(156x4) = 0.4796 

The above calculated parallel performance evaluation indexes are 
collected as shown in Fig. 59. Since Ap(4) = 9.881, a parallel processing 
at p > 9 is meaningless. An actual performance improvement Ep*p at the 
time when the number of processors, p = 4, are applied is 1.918. The 
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reason is that the parallel efficiency Ep becomes 63% (Rb(4) = 0.6250) 
by the load balance contribution ratio, and is further lowered by 31% 
{Rred(4) = 0.3103) by redundancy processing. As compared with Fig. 41, 
it is understood that the load balance contribution ratio Rb{4) is 
lowered from 0-9398 to 0.6250, The difference of the processor #1 as 
shown in Fig. 41 and Fig. 59 is reflected in the performance evaluation 
index Rb(4) and this result is caused- In general, when equally divided 
tasks- are processed by processors different in CPU capability, the load 
balance is lost. In this embodiment, this can be detected by Rb(4)- 

Although the present invention has been described with respect 
to a specific preferred embodiment thereof, various change and 
modifications may be suggested to one skilled in the art, and it is 
intended that the present invention encompass such changes and 
modifications as fall within the scope of the appended claims . 
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